Cron jobs are the most neglected part of any infrastructure. They run silently in the background, doing their thing day after day. Until they stop. And when they stop, no one notices until it's too late.

The Problem with Cron

Cron has no built-in monitoring. When a job fails, cron doesn't alert anyone. When a job never runs because the crontab was deleted, there's no notification. This silence is dangerous.

🚨

Real-World Horror Stories

Backup job stopped 3 weeks ago. Server update cleared the crontab. Discovered during disaster recovery.
Daily report never ran. Script permissions changed, cron sent email to /dev/null.
Database cleanup hung. Job started but never finished, tables grew unbounded.
Certificate renewal failed. Let's Encrypt cron job errored, site went down on expiry day.

Five Cron Failure Modes

1. Job Never Started

The most common failure: the job didn't run at all.

Crontab deleted or modified
Cron daemon not running
Server rebooted and cron didn't start
Wrong user's crontab edited

Detection: Heartbeat monitoring. The job must ping an external service when it starts.

2. Job Started but Failed

The job ran, but exited with an error.

Script error or exception
Missing dependencies
Permission denied
Database connection failed

Detection: Track exit codes. Ping the heartbeat only on successful completion.

3. Job Started but Never Finished

The job is hanging indefinitely.

Deadlock or infinite loop
Waiting on locked resource
Network timeout (no timeout configured)
OOM killed mid-execution

Detection: Duration monitoring. Set maximum expected runtime and alert if exceeded.

4. Job Succeeded but Data is Wrong

The job completed successfully, but produced incorrect results.

Zero records processed (but no error)
Partial completion (some records skipped)
Wrong environment variables
Stale credentials

Detection: Include result data in the heartbeat ping. Validate expected outcomes.

5. Job Runs but Too Slowly

The job completes, but takes much longer than expected.

Growing data volume
Resource contention
Inefficient queries
Network degradation

Detection: Track execution duration over time. Alert on increasing trends.

Implementing Heartbeat Monitoring

The solution is simple: your cron job pings an external service when it runs. If the ping doesn't arrive, you get alerted.

Basic Implementation

backup.sh

#!/bin/bash
# backup.sh - nightly backup with heartbeat monitoring

set -e

# Run the backup
pg_dump mydb > /backups/mydb-$(date +%Y%m%d).sql

# Compress
gzip /backups/mydb-$(date +%Y%m%d).sql

# Only ping heartbeat if everything succeeded
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/your-token

With Duration Tracking

job-with-duration.sh

#!/bin/bash
START_TIME=$(date +%s)

# Signal start
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/your-token/start

# Run the job
/path/to/your/job.sh
EXIT_CODE=$?

# Calculate duration
DURATION=$(($(date +%s) - START_TIME))

# Signal completion with status
curl -fsS -m 10 "https://agents.upti.my/v1/heartbeat/your-token?exit_code=$EXIT_CODE&duration=$DURATION"

Monitoring with upti.my

upti.my provides purpose-built heartbeat monitoring for cron jobs:

Flexible schedules. Cron expressions, intervals, or calendar-based.
Grace periods. Allow for natural variation in execution time.
Exit code tracking. Distinguish between "didn't run" and "ran but failed".
Duration monitoring. Alert when jobs take too long.
Start/end pings. Detect hanging jobs.

⚠️

Common Mistakes

Pinging at the start. Ping at the end, after successful completion.
No timeout on the ping. Use -m 10 to prevent the ping itself from hanging.
Relying on cron email. Cron email goes to local mailbox that nobody reads.
Same grace period for all jobs. A 5-minute job needs different tolerance than a 5-hour job.

📌Key Takeaways

1Cron has no built-in failure notification
2Jobs can fail in five distinct ways, monitor all of them
3Heartbeat monitoring is the solution: jobs ping when they complete
4Track duration and exit codes, not just "job ran"
5Set appropriate grace periods for natural variation

Don't wait for your backup to fail to discover it hasn't run in weeks. Set up heartbeat monitoring for every scheduled job.

Cron Job Monitoring: Common Failure Modes