Cron jobs and scheduled tasks fail silently. The backup that was supposed to run at 2 AM? It stopped three days ago and nobody noticed. The data sync that feeds your dashboard? It hung mid-run and never completed. These are the failures that heartbeat monitoring is built to catch.
Heartbeat monitoring works like a dead man's switch: your job pings a monitoring endpoint each time it runs, and you get alerted when the ping doesn't arrive. It detects missed runs, late runs, and hanging jobs — the common cron failure modes that traditional monitoring misses entirely. If you want a broader overview of cron monitoring strategies, see our guide on how to monitor cron jobs properly.
The concept is simple, but the implementation has subtle edge cases that trip up even experienced engineers. This guide covers the architecture, the edge cases, and the patterns that work in production.
The Core Concept
A heartbeat monitor is a "dead man's switch". It alerts when something doesn't happen. Traditional monitoring watches for events. Heartbeat monitoring watches for the absence of events.
Traditional: Alert when X happens
Heartbeat: Alert when X doesn't happen
Examples:
- Backup job didn't run
- Worker stopped processing
- Scheduled report wasn't generatedArchitecture Overview
A heartbeat monitoring system has four components:
- Ping Ingestion. HTTP endpoints that receive pings.
- State Storage. Tracks last ping time per monitor.
- Scheduler. Checks for missing pings on schedule.
- Alerting. Notifies when pings are late.
Data Model
interface HeartbeatMonitor {
id: string;
name: string;
// Schedule: when pings are expected
schedule: CronExpression | IntervalSeconds;
gracePeriod: number; // seconds to wait before alerting
// State
lastPingAt: Date | null;
status: 'healthy' | 'late' | 'down';
// Alerting
alertChannels: string[];
alertedAt: Date | null; // prevent duplicate alerts
}Edge Cases
1. Grace Periods
Jobs don't run at exactly the same time every day. A 5-minute job might take 7 minutes on a busy day. Grace periods absorb natural variation.
Expected ping: 02:00:00 UTC
Grace period: 5 minutes
Alert fires: 02:05:00 UTC (if no ping)
Too short: false alarms on slow days
Too long: late detection of real failures2. Timezone Handling
Cron expressions are timezone-sensitive. "Run at 2 AM" means different times depending on whether that's UTC, America/New_York, or Europe/London.
// Store timezone with the monitor
schedule: "0 2 * * *",
timezone: "America/New_York"
// Convert to UTC for scheduling
nextExpectedPing = cronToUtc(schedule, timezone);
// DST edge case: 2 AM might not exist (spring forward)
// or happen twice (fall back)Daylight Saving Time
- Spring forward: 2:00 AM doesn't exist. A job scheduled for 2:30 AM may or may not run.
- Fall back: 2:00 AM happens twice. The job might run twice, or once, or at an unexpected offset.
Solution: Store schedules in UTC internally, only convert for display.
4. Missed Checks During Outages
If your monitoring system itself is down, you miss the window to detect a late ping. When you come back up, the job already ran (or didn't), and you don't know which.
Timeline:
02:00 - Job should run
02:05 - Grace period expires (would have alerted)
02:10 - Monitoring system was down during 02:00-02:10
02:11 - System comes back up
Question: Did the job run at 02:00?
Options:
A) Assume it ran (optimistic) - risky
B) Assume it didn't (pessimistic) - noisy
C) Check job logs/output - best but complex5. Clock Skew
If the job's server clock is wrong, pings arrive at unexpected times. A 5-minute clock skew can cause false alarms or missed detection.
// Record when the ping was sent (from job) vs received
{
receivedAt: "2024-01-15T02:03:00Z", // Our clock
sentAt: "2024-01-15T02:08:00Z", // Job's clock (5 min ahead)
// Job thinks it's late, but it's actually on time
// Use receivedAt for alerting, but log skew for debugging
}6. Duplicate Pings
Network retries can cause duplicate pings. Your system should be idempotent.
// Naive: Update lastPingAt on every request
// Problem: Rapid retries spam your database
// Better: Debounce within a window
if (lastPingAt > now - 30seconds) {
return; // Ignore duplicate
}
lastPingAt = now;Scaling Considerations
High-Frequency Pings
If jobs ping every 30 seconds and you have 10,000 monitors, that's 20,000 writes/minute. Design for write-heavy workloads.
Options:
1. Time-series database (InfluxDB, TimescaleDB)
2. Redis with TTL-based expiration
3. Write coalescing / batch updates
4. Separate hot/cold storageDistributed Scheduler
The scheduler that checks for late pings must be highly available. If it goes down, no alerts fire.
Approaches:
1. Multiple schedulers with leader election
2. Partition monitors across workers
3. Pull-based: each check runs independentlyImplementation Patterns
Pattern 1: Push-Based (Recommended)
Job pings the monitoring service. Simplest to implement and debug.
# In your cron job
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/TOKEN
# Monitoring service updates lastPingAt
# Scheduler checks for overdue pings every minutePattern 2: Pull-Based
Monitoring service queries job status. More complex, but works for jobs that can't make outbound HTTP calls.
# Job writes status to shared location
echo "success:$(date +%s)" > /status/backup-job
# Monitoring service reads periodically
status = read_file("/status/backup-job")
if (status.timestamp < now - grace_period) alert()Pattern 3: Hybrid
Combine push for normal operation with pull for verification.
upti.my Implementation
- Flexible schedules. Cron expressions with timezone support.
- Configurable grace periods. Per-monitor customization.
- Start/end pings. Detect hung jobs, not just missing ones.
- Exit code tracking. Distinguish "didn't run" from "ran but failed".
- Duration monitoring. Alert when jobs take too long.
📌Key Takeaways
- 1Heartbeat monitoring detects when expected events don't happen
- 2Grace periods absorb natural variation in job timing
- 3Timezone and DST handling require careful implementation
- 4Plan for monitoring system outages, what happens when you're down?
- 5Design for write-heavy workloads at scale
Building a robust heartbeat monitoring system is harder than it looks. Consider using a purpose-built solution rather than building your own.