Heartbeat monitoring is conceptually simple: a job pings you when it runs, and you alert if the ping doesn't arrive. But the implementation has subtle edge cases that trip up even experienced engineers.
The Core Concept
A heartbeat monitor is a "dead man's switch". It alerts when something doesn't happen. Traditional monitoring watches for events. Heartbeat monitoring watches for the absence of events.
Traditional: Alert when X happens
Heartbeat: Alert when X doesn't happen
Examples:
- Backup job didn't run
- Worker stopped processing
- Scheduled report wasn't generatedArchitecture Overview
A heartbeat monitoring system has four components:
- Ping Ingestion. HTTP endpoints that receive pings.
- State Storage. Tracks last ping time per monitor.
- Scheduler. Checks for missing pings on schedule.
- Alerting. Notifies when pings are late.
Data Model
interface HeartbeatMonitor {
id: string;
name: string;
// Schedule: when pings are expected
schedule: CronExpression | IntervalSeconds;
gracePeriod: number; // seconds to wait before alerting
// State
lastPingAt: Date | null;
status: 'healthy' | 'late' | 'down';
// Alerting
alertChannels: string[];
alertedAt: Date | null; // prevent duplicate alerts
}Edge Cases
1. Grace Periods
Jobs don't run at exactly the same time every day. A 5-minute job might take 7 minutes on a busy day. Grace periods absorb natural variation.
Expected ping: 02:00:00 UTC
Grace period: 5 minutes
Alert fires: 02:05:00 UTC (if no ping)
Too short: false alarms on slow days
Too long: late detection of real failures2. Timezone Handling
Cron expressions are timezone-sensitive. "Run at 2 AM" means different times depending on whether that's UTC, America/New_York, or Europe/London.
// Store timezone with the monitor
schedule: "0 2 * * *",
timezone: "America/New_York"
// Convert to UTC for scheduling
nextExpectedPing = cronToUtc(schedule, timezone);
// DST edge case: 2 AM might not exist (spring forward)
// or happen twice (fall back)Daylight Saving Time
- Spring forward: 2:00 AM doesn't exist. A job scheduled for 2:30 AM may or may not run.
- Fall back: 2:00 AM happens twice. The job might run twice, or once, or at an unexpected offset.
Solution: Store schedules in UTC internally, only convert for display.
4. Missed Checks During Outages
If your monitoring system itself is down, you miss the window to detect a late ping. When you come back up, the job already ran (or didn't), and you don't know which.
Timeline:
02:00 - Job should run
02:05 - Grace period expires (would have alerted)
02:10 - Monitoring system was down during 02:00-02:10
02:11 - System comes back up
Question: Did the job run at 02:00?
Options:
A) Assume it ran (optimistic) - risky
B) Assume it didn't (pessimistic) - noisy
C) Check job logs/output - best but complex5. Clock Skew
If the job's server clock is wrong, pings arrive at unexpected times. A 5-minute clock skew can cause false alarms or missed detection.
// Record when the ping was sent (from job) vs received
{
receivedAt: "2024-01-15T02:03:00Z", // Our clock
sentAt: "2024-01-15T02:08:00Z", // Job's clock (5 min ahead)
// Job thinks it's late, but it's actually on time
// Use receivedAt for alerting, but log skew for debugging
}6. Duplicate Pings
Network retries can cause duplicate pings. Your system should be idempotent.
// Naive: Update lastPingAt on every request
// Problem: Rapid retries spam your database
// Better: Debounce within a window
if (lastPingAt > now - 30seconds) {
return; // Ignore duplicate
}
lastPingAt = now;Scaling Considerations
High-Frequency Pings
If jobs ping every 30 seconds and you have 10,000 monitors, that's 20,000 writes/minute. Design for write-heavy workloads.
Options:
1. Time-series database (InfluxDB, TimescaleDB)
2. Redis with TTL-based expiration
3. Write coalescing / batch updates
4. Separate hot/cold storageDistributed Scheduler
The scheduler that checks for late pings must be highly available. If it goes down, no alerts fire.
Approaches:
1. Multiple schedulers with leader election
2. Partition monitors across workers
3. Pull-based: each check runs independentlyImplementation Patterns
Pattern 1: Push-Based (Recommended)
Job pings the monitoring service. Simplest to implement and debug.
# In your cron job
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/TOKEN
# Monitoring service updates lastPingAt
# Scheduler checks for overdue pings every minutePattern 2: Pull-Based
Monitoring service queries job status. More complex, but works for jobs that can't make outbound HTTP calls.
# Job writes status to shared location
echo "success:$(date +%s)" > /status/backup-job
# Monitoring service reads periodically
status = read_file("/status/backup-job")
if (status.timestamp < now - grace_period) alert()Pattern 3: Hybrid
Combine push for normal operation with pull for verification.
upti.my Implementation
- Flexible schedules. Cron expressions with timezone support.
- Configurable grace periods. Per-monitor customization.
- Start/end pings. Detect hung jobs, not just missing ones.
- Exit code tracking. Distinguish "didn't run" from "ran but failed".
- Duration monitoring. Alert when jobs take too long.
📌Key Takeaways
- 1Heartbeat monitoring detects when expected events don't happen
- 2Grace periods absorb natural variation in job timing
- 3Timezone and DST handling require careful implementation
- 4Plan for monitoring system outages, what happens when you're down?
- 5Design for write-heavy workloads at scale
Building a robust heartbeat monitoring system is harder than it looks. Consider using a purpose-built solution rather than building your own.