Cron is a 50-year-old scheduler with zero observability built in. It runs your job, maybe logs something to syslog, and moves on. If the job fails, crashes, or never starts, cron does not care. That is your problem.
Most teams learn this the hard way. A backup job fails for two weeks, a billing script silently errors out, or a database cleanup stops running after a server migration. By the time someone notices, the damage is done.
Here is how to monitor cron jobs properly so that doesn't happen.
The Heartbeat Pattern
The most reliable way to monitor a cron job is with a heartbeat (also called a dead man's switch). The idea is simple: your cron job pings a URL when it completes successfully. If the ping doesn't arrive within the expected window, you get alerted.
This catches three failure modes that nothing else does:
- The job never ran (crontab deleted, server rebooted, wrong timezone)
- The job started but crashed before finishing
- The job ran but took too long
#!/bin/bash
set -e
# Run backup
pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql
gzip /backups/mydb_$(date +%Y%m%d).sql
# Signal success
curl -s https://heartbeats.upti.my/v1/heartbeat/your-check-idWhy curl at the end?
set -e), the script exits before reaching the curl. No ping means no success signal means you get alerted.Setting Proper Timeouts
A heartbeat check needs a grace period. If your job runs every hour, you don't want an alert if it's 30 seconds late. But you do want an alert if it's 15 minutes late.
A good rule of thumb:
- Period: Match your cron schedule. Job runs every hour? Period is 1 hour.
- Grace period: 2x the typical job duration. Job takes 5 minutes? Grace period is 10 minutes.
Too tight and you get false alarms. Too loose and you don't catch failures fast enough. Start conservative and tighten as you learn your job's actual behavior.
Monitoring Job Duration
A job that used to take 2 minutes and now takes 45 minutes is a problem even if it still completes. Growing duration usually means growing data, resource contention, or a query that lost its index.
#!/bin/bash
set -e
# Signal start
curl -s https://heartbeats.upti.my/v1/heartbeat/your-check-id?step=start
# Run backup
pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql
gzip /backups/mydb_$(date +%Y%m%d).sql
# Signal completion
curl -s https://heartbeats.upti.my/v1/heartbeat/your-check-id?step=completeBy pinging both at the start and end, you can track job duration over time and set alerts when it crosses a threshold.
Validating Job Output
A job can complete successfully and still produce bad results. The backup finishes, but the file is 0 bytes. The report generates, but it's empty. The sync runs, but it processed 0 records.
#!/bin/bash
set -e
BACKUP_FILE="/backups/mydb_$(date +%Y%m%d).sql.gz"
pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql
gzip /backups/mydb_$(date +%Y%m%d).sql
# Validate output
FILE_SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt 1000 ]; then
echo "Backup file suspiciously small: $FILE_SIZE bytes"
exit 1
fi
curl -s https://heartbeats.upti.my/v1/heartbeat/your-check-idHandling Failure Signals
When a job fails, you want to know why. Send the exit code and any error output along with your failure ping:
#!/bin/bash
PING_URL="https://heartbeats.upti.my/v1/heartbeat/your-check-id"
curl -s "$PING_URL/start"
OUTPUT=$(your_actual_command 2>&1)
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
curl -s "$PING_URL/fail" -d "$OUTPUT"
exit $EXIT_CODE
fi
curl -s "$PING_URL" -d "$OUTPUT"Common Mistakes
1. Monitoring the Cron Daemon Instead of the Job
Checking that crond is running tells you nothing about whether your specific jobs are executing. The daemon can be healthy while half your jobs are silently failing.
2. Only Checking Exit Codes
Many scripts don't use set -e and don't check return values. A command fails midway, the script continues, and exits 0. Your monitoring sees success.
3. Forgetting Timezone Issues
Cron uses the system timezone. If you migrated to a cloud VM in UTC but your crontab expects EST, your jobs run at the wrong time. Heartbeat monitoring catches this because the ping arrives outside the expected window.
4. Not Monitoring Jobs After Infrastructure Changes
Server migration, container rebuild, OS upgrade. These are the events that silently kill crontabs. If you rely on heartbeat monitoring, you catch this immediately.
Setting This Up in upti.my
- Create a heartbeat healthcheck in the dashboard
- Set the expected period and grace time
- Copy the ping URL and add it to your cron script
- Optionally add start/fail pings for duration tracking and failure details
- Configure alerts to Slack, email, or webhook
Every ping is logged with timestamps, so you get a full history of when your job ran, how long it took, and whether it succeeded.
📌Key Takeaways
- 1Use heartbeat (dead man's switch) monitoring, not process checks
- 2Ping at the end of the script so failures prevent the success signal
- 3Set grace periods based on actual job duration, not arbitrary values
- 4Track job duration over time to catch performance degradation
- 5Validate job output before signaling success
- 6Always use set -e in bash scripts to catch mid-script failures