upti.my
All Articles
Infrastructure··8 min read

How to Monitor Cron Jobs Properly

Cron jobs fail silently by design. Here is a practical guide to monitoring scheduled tasks with heartbeats, timeout detection, and output validation.

Cron is a 50-year-old scheduler with zero observability built in. It runs your job, maybe logs something to syslog, and moves on. If the job fails, crashes, or never starts, cron does not care. That is your problem.

Most teams learn this the hard way. A backup job fails for two weeks, a billing script silently errors out, or a database cleanup stops running after a server migration. By the time someone notices, the damage is done.

Here is how to monitor cron jobs properly so that doesn't happen.

The Heartbeat Pattern

The most reliable way to monitor a cron job is with a heartbeat (also called a dead man's switch). The idea is simple: your cron job pings a URL when it completes successfully. If the ping doesn't arrive within the expected window, you get alerted.

This catches three failure modes that nothing else does:

  • The job never ran (crontab deleted, server rebooted, wrong timezone)
  • The job started but crashed before finishing
  • The job ran but took too long
backup.sh
#!/bin/bash
set -e

# Run backup
pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql
gzip /backups/mydb_$(date +%Y%m%d).sql

# Signal success
curl -s https://heartbeats.upti.my/v1/heartbeat/your-check-id

Why curl at the end?

Put the heartbeat ping at the end of your script, after all work is done. If any command fails (and you have set -e), the script exits before reaching the curl. No ping means no success signal means you get alerted.

Setting Proper Timeouts

A heartbeat check needs a grace period. If your job runs every hour, you don't want an alert if it's 30 seconds late. But you do want an alert if it's 15 minutes late.

A good rule of thumb:

  • Period: Match your cron schedule. Job runs every hour? Period is 1 hour.
  • Grace period: 2x the typical job duration. Job takes 5 minutes? Grace period is 10 minutes.

Too tight and you get false alarms. Too loose and you don't catch failures fast enough. Start conservative and tighten as you learn your job's actual behavior.

Monitoring Job Duration

A job that used to take 2 minutes and now takes 45 minutes is a problem even if it still completes. Growing duration usually means growing data, resource contention, or a query that lost its index.

backup-with-timing.sh
#!/bin/bash
set -e

# Signal start
curl -s https://heartbeats.upti.my/v1/heartbeat/your-check-id?step=start

# Run backup
pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql
gzip /backups/mydb_$(date +%Y%m%d).sql

# Signal completion
curl -s https://heartbeats.upti.my/v1/heartbeat/your-check-id?step=complete

By pinging both at the start and end, you can track job duration over time and set alerts when it crosses a threshold.

Validating Job Output

A job can complete successfully and still produce bad results. The backup finishes, but the file is 0 bytes. The report generates, but it's empty. The sync runs, but it processed 0 records.

backup-validated.sh
#!/bin/bash
set -e

BACKUP_FILE="/backups/mydb_$(date +%Y%m%d).sql.gz"

pg_dump mydb > /backups/mydb_$(date +%Y%m%d).sql
gzip /backups/mydb_$(date +%Y%m%d).sql

# Validate output
FILE_SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")
if [ "$FILE_SIZE" -lt 1000 ]; then
  echo "Backup file suspiciously small: $FILE_SIZE bytes"
  exit 1
fi

curl -s https://heartbeats.upti.my/v1/heartbeat/your-check-id

Handling Failure Signals

When a job fails, you want to know why. Send the exit code and any error output along with your failure ping:

job-with-error-handling.sh
#!/bin/bash

PING_URL="https://heartbeats.upti.my/v1/heartbeat/your-check-id"

curl -s "$PING_URL/start"

OUTPUT=$(your_actual_command 2>&1)
EXIT_CODE=$?

if [ $EXIT_CODE -ne 0 ]; then
  curl -s "$PING_URL/fail" -d "$OUTPUT"
  exit $EXIT_CODE
fi

curl -s "$PING_URL" -d "$OUTPUT"

Common Mistakes

1. Monitoring the Cron Daemon Instead of the Job

Checking that crond is running tells you nothing about whether your specific jobs are executing. The daemon can be healthy while half your jobs are silently failing.

2. Only Checking Exit Codes

Many scripts don't use set -e and don't check return values. A command fails midway, the script continues, and exits 0. Your monitoring sees success.

3. Forgetting Timezone Issues

Cron uses the system timezone. If you migrated to a cloud VM in UTC but your crontab expects EST, your jobs run at the wrong time. Heartbeat monitoring catches this because the ping arrives outside the expected window.

4. Not Monitoring Jobs After Infrastructure Changes

Server migration, container rebuild, OS upgrade. These are the events that silently kill crontabs. If you rely on heartbeat monitoring, you catch this immediately.

Setting This Up in upti.my

  1. Create a heartbeat healthcheck in the dashboard
  2. Set the expected period and grace time
  3. Copy the ping URL and add it to your cron script
  4. Optionally add start/fail pings for duration tracking and failure details
  5. Configure alerts to Slack, email, or webhook

Every ping is logged with timestamps, so you get a full history of when your job ran, how long it took, and whether it succeeded.

📌Key Takeaways

  • 1Use heartbeat (dead man's switch) monitoring, not process checks
  • 2Ping at the end of the script so failures prevent the success signal
  • 3Set grace periods based on actual job duration, not arbitrary values
  • 4Track job duration over time to catch performance degradation
  • 5Validate job output before signaling success
  • 6Always use set -e in bash scripts to catch mid-script failures