Every team that has been on-call long enough has the same complaint: too many alerts. Slack channels filled with red notifications. Phone buzzing at 3 AM for something that resolved itself 30 seconds later. Engineers who stop reading alerts because 90% of them are noise.

The standard advice is "reduce the number of alerts." That is only half the answer. The real problem is not volume. It is routing. The wrong alerts go to the wrong people at the wrong time through the wrong channels.

Why Alert Fatigue Is Dangerous

Alert fatigue does not just annoy engineers. It causes real incidents to get missed. When every alert feels like a false alarm, the real failure gets the same response as the rest: ignored.

Studies across industries (medicine, aviation, IT operations) show the same pattern: when alert rates exceed a threshold, response quality drops to near zero. Engineers develop coping mechanisms that bypass the alerting system entirely.

The Four Causes of Alert Noise

1. No Incident Conditions

Every failed check triggers an alert. A single timeout, a brief network hiccup, a CDN edge node restarting. Without incident conditions that require multiple consecutive failures before alerting, you get notifications for transient issues that resolve on their own.

incident-condition.json

{
  "trigger": {
    "type": "consecutive_failures",
    "count": 3
  },
  "resolve": {
    "type": "consecutive_successes",
    "count": 2
  }
}

Requiring 3 consecutive failures before alerting eliminates most false positives. Requiring 2 consecutive successes before resolving prevents flapping (alert, resolve, alert, resolve).

2. No Severity Differentiation

When every alert has the same priority, none of them feel urgent. A staging environment SSL warning and a production database outage should not arrive through the same channel with the same sound.

3. No Time-Based Routing

Waking someone up at 3 AM for a non-critical issue is destructive. During business hours, a Slack message is fine. After hours, only critical issues should page. Time-based routing matches urgency to the disruption level.

4. Everyone Gets Everything

The frontend engineer doesn't need alerts about database replication lag. The backend engineer doesn't need alerts about the marketing site's SSL certificate. Route alerts to the team or person who can actually fix the problem.

Building a Smarter Routing Strategy

Step 1: Classify Your Checks by Severity

Not all checks are equal. Classify them:

Critical: Production API down, database unreachable, payment processing broken. Page immediately.
Warning: Response time degradation, SSL expiring in 7 days, queue depth growing. Notify during business hours.
Info: Staging environment issues, non-critical endpoint degradation. Log it, review weekly.

Step 2: Set Up Incident Conditions

Require multiple failures before creating an incident. A single failed ping is noise. Three consecutive failures from multiple monitoring locations is a real problem.

Step 3: Route by Severity and Time

Critical + any time: Phone call, SMS, push notification
Critical + business hours: Slack + phone call
Warning + business hours: Slack message
Warning + after hours: Queue for morning review
Info: Dashboard only

Step 4: Route by Component

Payment system alerts go to the payments team. Infrastructure alerts go to the platform team. Marketing site alerts go to the web team. If your team is small enough that everyone handles everything, skip this step.

Step 5: Add Escalation

If the primary on-call engineer does not acknowledge within 10 minutes, escalate to the backup. If the backup does not acknowledge, escalate to the team lead. Dead simple, but most teams skip this and end up with alerts that nobody sees.

✅

Start with the wake-up test

For every alert, ask: "Would I wake someone up at 3 AM for this?" If no, it should not page after hours. This single filter eliminates most alert fatigue.

Measuring Improvement

Track these metrics to know if your routing is working:

Alert-to-incident ratio: How many alerts result in real incidents? Target above 50%.
After-hours pages per week: Track and aim to reduce.
Mean time to acknowledge: If it is increasing, people are ignoring alerts.
False positive rate: Track alerts that were acknowledged but required no action.

In upti.my, you can build these routing rules using workflows. Define conditions based on the check, the severity, the time of day, and the affected component. Chain actions together: notify Slack, wait 10 minutes, escalate to phone if unacknowledged.

📌Key Takeaways

1Alert fatigue is primarily a routing problem, not a volume problem
2Require multiple consecutive failures before alerting to eliminate transients
3Differentiate alerts by severity: critical pages, warnings notify, info logs
4Use time-based routing to protect after-hours rest
5Route to the person who can fix the problem, not everyone
6Track alert-to-incident ratio to measure noise reduction

How to Reduce Alert Fatigue With Smarter Routing