Every team that has been on-call long enough has the same complaint: too many alerts. Slack channels filled with red notifications. Phone buzzing at 3 AM for something that resolved itself 30 seconds later. Engineers who stop reading alerts because 90% of them are noise.
The standard advice is "reduce the number of alerts." That is only half the answer. The real problem is not volume. It is routing. The wrong alerts go to the wrong people at the wrong time through the wrong channels.
Why Alert Fatigue Is Dangerous
Alert fatigue does not just annoy engineers. It causes real incidents to get missed. When every alert feels like a false alarm, the real failure gets the same response as the rest: ignored.
Studies across industries (medicine, aviation, IT operations) show the same pattern: when alert rates exceed a threshold, response quality drops to near zero. Engineers develop coping mechanisms that bypass the alerting system entirely.
The Four Causes of Alert Noise
1. No Incident Conditions
Every failed check triggers an alert. A single timeout, a brief network hiccup, a CDN edge node restarting. Without incident conditions that require multiple consecutive failures before alerting, you get notifications for transient issues that resolve on their own.
{
"trigger": {
"type": "consecutive_failures",
"count": 3
},
"resolve": {
"type": "consecutive_successes",
"count": 2
}
}Requiring 3 consecutive failures before alerting eliminates most false positives. Requiring 2 consecutive successes before resolving prevents flapping (alert, resolve, alert, resolve).
2. No Severity Differentiation
When every alert has the same priority, none of them feel urgent. A staging environment SSL warning and a production database outage should not arrive through the same channel with the same sound.
3. No Time-Based Routing
Waking someone up at 3 AM for a non-critical issue is destructive. During business hours, a Slack message is fine. After hours, only critical issues should page. Time-based routing matches urgency to the disruption level.
4. Everyone Gets Everything
The frontend engineer doesn't need alerts about database replication lag. The backend engineer doesn't need alerts about the marketing site's SSL certificate. Route alerts to the team or person who can actually fix the problem.
Building a Smarter Routing Strategy
Step 1: Classify Your Checks by Severity
Not all checks are equal. Classify them:
- Critical: Production API down, database unreachable, payment processing broken. Page immediately.
- Warning: Response time degradation, SSL expiring in 7 days, queue depth growing. Notify during business hours.
- Info: Staging environment issues, non-critical endpoint degradation. Log it, review weekly.
Step 2: Set Up Incident Conditions
Require multiple failures before creating an incident. A single failed ping is noise. Three consecutive failures from multiple monitoring locations is a real problem.
Step 3: Route by Severity and Time
- Critical + any time: Phone call, SMS, push notification
- Critical + business hours: Slack + phone call
- Warning + business hours: Slack message
- Warning + after hours: Queue for morning review
- Info: Dashboard only
Step 4: Route by Component
Payment system alerts go to the payments team. Infrastructure alerts go to the platform team. Marketing site alerts go to the web team. If your team is small enough that everyone handles everything, skip this step.
Step 5: Add Escalation
If the primary on-call engineer does not acknowledge within 10 minutes, escalate to the backup. If the backup does not acknowledge, escalate to the team lead. Dead simple, but most teams skip this and end up with alerts that nobody sees.
Start with the wake-up test
Measuring Improvement
Track these metrics to know if your routing is working:
- Alert-to-incident ratio: How many alerts result in real incidents? Target above 50%.
- After-hours pages per week: Track and aim to reduce.
- Mean time to acknowledge: If it is increasing, people are ignoring alerts.
- False positive rate: Track alerts that were acknowledged but required no action.
In upti.my, you can build these routing rules using workflows. Define conditions based on the check, the severity, the time of day, and the affected component. Chain actions together: notify Slack, wait 10 minutes, escalate to phone if unacknowledged.
📌Key Takeaways
- 1Alert fatigue is primarily a routing problem, not a volume problem
- 2Require multiple consecutive failures before alerting to eliminate transients
- 3Differentiate alerts by severity: critical pages, warnings notify, info logs
- 4Use time-based routing to protect after-hours rest
- 5Route to the person who can fix the problem, not everyone
- 6Track alert-to-incident ratio to measure noise reduction