upti.my
All Articles
Infrastructure··9 min read

Detecting Silent Failures in Background Workers

Queue workers fail without fanfare. Learn patterns for detecting when your background jobs stop processing.

Background workers are the invisible backbone of modern applications. They process emails, resize images, sync data, and handle everything users don't want to wait for. And when they stop working, nobody notices. Not until the damage is done.

The Silent Failure Problem

Unlike web servers that immediately show errors to users, background workers fail silently. There's no user to complain. No HTTP 500. No immediate symptoms.

🚨

How Workers Fail Silently

  • Worker process dies. Container crash, OOM kill, or unhandled exception.
  • Worker stuck on one job. Infinite loop, deadlock, or waiting on external service.
  • Worker can't connect to queue. Redis/RabbitMQ credentials expired or network issue.
  • Worker processing but failing every job. All jobs error out and move to dead letter queue.
  • Worker too slow. Processing but can't keep up with incoming job rate.

Detection Patterns

Pattern 1: Heartbeat Pings

The simplest approach: your worker pings an external service at regular intervals. If the pings stop, the worker is dead.

email-worker.ts
// Worker with heartbeat
class EmailWorker {
  private heartbeatInterval: NodeJS.Timer;

  start() {
    // Ping every minute while processing
    this.heartbeatInterval = setInterval(() => {
      fetch('https://agents.upti.my/v1/heartbeat/email-worker');
    }, 60000);

    this.processQueue();
  }

  stop() {
    clearInterval(this.heartbeatInterval);
  }
}

Pattern 2: Queue Depth Monitoring

Monitor the queue length over time. A growing queue means workers aren't keeping up.

queue-monitor.ts
// Monitor queue depth
const queueDepth = await redis.llen('jobs:email');
await fetch('https://upti.my/metrics', {
  method: 'POST',
  body: JSON.stringify({
    metric: 'queue_depth',
    queue: 'email',
    value: queueDepth
  })
});

// Alert if queue depth growing consistently

Pattern 3: Job Age Monitoring

Track how long jobs wait before being processed. Old jobs indicate worker problems.

job-age-check.ts
// Check oldest job in queue
const oldestJob = await redis.lindex('jobs:email', -1);
const jobAge = Date.now() - oldestJob.createdAt;

if (jobAge > 5 * 60 * 1000) { // 5 minutes
  alert('Email queue backlogged');
}

Pattern 4: Throughput Monitoring

Track jobs processed per minute. A drop in throughput indicates problems.

throughput-tracker.ts
// Track successful job completion
async function processJob(job) {
  try {
    await handleJob(job);
    // Increment success counter
    await redis.incr('stats:email:success:' + getMinuteBucket());
  } catch (error) {
    await redis.incr('stats:email:failure:' + getMinuteBucket());
    throw error;
  }
}

Pattern 5: Dead Letter Queue Monitoring

Failed jobs eventually end up in a dead letter queue. Monitor its growth.

dlq-monitor.ts
// Monitor dead letter queue
const dlqDepth = await redis.llen('jobs:email:dead');
if (dlqDepth > 100) {
  alert('Too many failed email jobs');
}

Combining Patterns

No single pattern catches all failures. Use multiple approaches:

💡

Failure Mode Detection Matrix

Failure ModeDetection Pattern
Worker process diedHeartbeat stops
Worker stuck on jobHeartbeat stops, queue grows
All jobs failingDLQ growing, throughput drops
Worker too slowQueue depth growing, job age increasing
Partial failuresError rate increasing, DLQ growing

Implementation with upti.my

upti.my provides integrated monitoring for background workers:

  • Heartbeat endpoints. Workers ping on regular intervals.
  • Custom metrics. Track queue depth, throughput, and error rates.
  • Threshold alerts. Alert when metrics cross boundaries.
  • Grace periods. Avoid false alarms during deployments.
⚠️

Common Mistakes

  • Only monitoring the queue server. Redis being up doesn't mean workers are processing.
  • Heartbeat too frequent. Creates noise, wastes resources.
  • Heartbeat from wrong place. Ping from the processing loop, not the main process.
  • Ignoring error rates. Workers can be "working" while failing every job.
  • No baseline. You need to know normal throughput to detect anomalies.

📌Key Takeaways

  • 1Background workers fail silently and no user sees the error
  • 2Use heartbeats to detect dead or stuck workers
  • 3Monitor queue depth and job age for capacity issues
  • 4Track throughput and error rates for quality issues
  • 5Combine multiple patterns because no single metric catches everything

Background workers are critical infrastructure. Give them the monitoring they deserve.

U

Written by

Engineering Team

Ready to try upti.my?

14-day free trial of Pro plan. No credit card required.