upti.my
All Articles
Infrastructure··7 min read

Cron Job Monitoring: Common Failure Modes

Your nightly backup job failed 3 weeks ago. Here is how to catch silent cron failures before they become disasters.

Cron jobs are the most neglected part of any infrastructure. They run silently in the background, doing their thing day after day. Until they stop. And when they stop, no one notices until it's too late.

The Problem with Cron

Cron has no built-in monitoring. When a job fails, cron doesn't alert anyone. When a job never runs because the crontab was deleted, there's no notification. This silence is dangerous.

🚨

Real-World Horror Stories

  • Backup job stopped 3 weeks ago. Server update cleared the crontab. Discovered during disaster recovery.
  • Daily report never ran. Script permissions changed, cron sent email to /dev/null.
  • Database cleanup hung. Job started but never finished, tables grew unbounded.
  • Certificate renewal failed. Let's Encrypt cron job errored, site went down on expiry day.

Five Cron Failure Modes

1. Job Never Started

The most common failure: the job didn't run at all.

  • Crontab deleted or modified
  • Cron daemon not running
  • Server rebooted and cron didn't start
  • Wrong user's crontab edited

Detection: Heartbeat monitoring. The job must ping an external service when it starts.

2. Job Started but Failed

The job ran, but exited with an error.

  • Script error or exception
  • Missing dependencies
  • Permission denied
  • Database connection failed

Detection: Track exit codes. Ping the heartbeat only on successful completion.

3. Job Started but Never Finished

The job is hanging indefinitely.

  • Deadlock or infinite loop
  • Waiting on locked resource
  • Network timeout (no timeout configured)
  • OOM killed mid-execution

Detection: Duration monitoring. Set maximum expected runtime and alert if exceeded.

4. Job Succeeded but Data is Wrong

The job completed successfully, but produced incorrect results.

  • Zero records processed (but no error)
  • Partial completion (some records skipped)
  • Wrong environment variables
  • Stale credentials

Detection: Include result data in the heartbeat ping. Validate expected outcomes.

5. Job Runs but Too Slowly

The job completes, but takes much longer than expected.

  • Growing data volume
  • Resource contention
  • Inefficient queries
  • Network degradation

Detection: Track execution duration over time. Alert on increasing trends.

Implementing Heartbeat Monitoring

The solution is simple: your cron job pings an external service when it runs. If the ping doesn't arrive, you get alerted.

Basic Implementation

backup.sh
#!/bin/bash
# backup.sh - nightly backup with heartbeat monitoring

set -e

# Run the backup
pg_dump mydb > /backups/mydb-$(date +%Y%m%d).sql

# Compress
gzip /backups/mydb-$(date +%Y%m%d).sql

# Only ping heartbeat if everything succeeded
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/your-token

With Duration Tracking

job-with-duration.sh
#!/bin/bash
START_TIME=$(date +%s)

# Signal start
curl -fsS -m 10 https://agents.upti.my/v1/heartbeat/your-token/start

# Run the job
/path/to/your/job.sh
EXIT_CODE=$?

# Calculate duration
DURATION=$(($(date +%s) - START_TIME))

# Signal completion with status
curl -fsS -m 10 "https://agents.upti.my/v1/heartbeat/your-token?exit_code=$EXIT_CODE&duration=$DURATION"

Monitoring with upti.my

upti.my provides purpose-built heartbeat monitoring for cron jobs:

  • Flexible schedules. Cron expressions, intervals, or calendar-based.
  • Grace periods. Allow for natural variation in execution time.
  • Exit code tracking. Distinguish between "didn't run" and "ran but failed".
  • Duration monitoring. Alert when jobs take too long.
  • Start/end pings. Detect hanging jobs.
⚠️

Common Mistakes

  • Pinging at the start. Ping at the end, after successful completion.
  • No timeout on the ping. Use -m 10 to prevent the ping itself from hanging.
  • Relying on cron email. Cron email goes to local mailbox that nobody reads.
  • Same grace period for all jobs. A 5-minute job needs different tolerance than a 5-hour job.

📌Key Takeaways

  • 1Cron has no built-in failure notification
  • 2Jobs can fail in five distinct ways, monitor all of them
  • 3Heartbeat monitoring is the solution: jobs ping when they complete
  • 4Track duration and exit codes, not just "job ran"
  • 5Set appropriate grace periods for natural variation

Don't wait for your backup to fail to discover it hasn't run in weeks. Set up heartbeat monitoring for every scheduled job.