Incident Conditions
Define the conditions that determine when incidents are created. Incident conditions evaluate healthcheck results and create incidents when real problems are detected.
Overview
Incident conditions are the rules that tell upti.my when to create an incident. Every healthcheck can have one or more conditions attached to it. When a condition is met, upti.my automatically opens a new incident and records the affected healthcheck, timestamp, and severity.
Conditions focus on one thing: deciding whether a problem is real enough to warrant an incident. They do not handle notifications, enrichment, or routing. That part is handled by Workflows, where you configure destinations, message formatting, escalation chains, and everything else using a visual drag-and-drop builder.
ℹ️ Conditions Create. Workflows Notify.
Think of it this way: incident conditions decide when an incident is created. Workflows decide what happens next. This separation keeps your setup clean. You define detection logic in conditions and notification logic in workflows.
Common Settings
All condition types share the following configurable settings:
| Setting | Description |
|---|---|
| Severity | The severity assigned to incidents created by this condition: critical, warning, or info. Workflows can use severity to route notifications to the right channels. |
| Cooldown Period | Minimum time between incidents from the same condition. Prevents creating duplicate incidents for the same ongoing problem. |
| Working Hours | Optionally restrict incident creation to specific hours and days. Issues outside working hours are still recorded but do not create incidents until the next working period. |
| Tags | Organize and filter conditions with custom tags. Tags carry through to the created incident and can be used in workflow routing. |
Condition Types
1. Simple
The most straightforward condition. It creates an incident when a healthcheck fails a specified number of consecutive times. This is the default condition type and works well for clear-cut "is it up or down" monitoring.
| Field | Type | Description |
|---|---|---|
| threshold_count | integer | Number of consecutive failures before creating an incident. Default: 3. |
{
"type": "simple",
"threshold_count": 3,
"severity": "critical",
"cooldown_seconds": 300
}2. Threshold
Creates an incident based on the percentage of failures within a rolling time window. This is useful for services that have occasional transient failures. You might tolerate 10% failure but want an incident at 50%.
| Field | Type | Description |
|---|---|---|
| failure_percentage | integer (0-100) | Failure percentage that triggers incident creation, e.g., 50 means 50% failures. |
| window_seconds | integer | Rolling time window in seconds. Default: 300 (5 minutes). |
{
"type": "threshold",
"failure_percentage": 50,
"window_seconds": 600,
"severity": "warning",
"cooldown_seconds": 600
}ℹ️ Window Size Matters
Shorter windows (1 to 5 minutes) detect issues faster but may create incidents from transient failures. Longer windows (10 to 30 minutes) are more stable but slower to react. Match the window size to the criticality of the service.
3. Pattern
Detects specific failure patterns rather than simple counts or percentages. Pattern conditions excel at identifying flapping services (rapidly alternating between up and down) and specific sequences of errors that indicate a degrading system.
| Field | Type | Description |
|---|---|---|
| pattern_type | string | Pattern to detect: flapping or consecutive_errors |
| flap_threshold | integer | For flapping: number of state changes within the window that create an incident. Default: 5. |
| consecutive_count | integer | For consecutive_errors: number of errors in a row. Default: 5. |
| window_seconds | integer | Time window for pattern evaluation. Default: 600 (10 minutes). |
{
"type": "pattern",
"pattern_type": "flapping",
"flap_threshold": 5,
"window_seconds": 600,
"severity": "warning",
"cooldown_seconds": 900
}4. Escalation
Multi-stage conditions that create incidents with increasing severity over time if the problem persists. Each stage has its own delay and severity level. The incident is initially created at the first stage's severity, then automatically escalated through subsequent stages if unresolved.
| Field | Type | Description |
|---|---|---|
| stages | array | Array of escalation stages, each with a delay and severity. |
| stages[].delay_seconds | integer | Time in seconds after the initial failure before this stage activates. |
| stages[].severity | string | Severity for this stage: info, warning, or critical |
{
"type": "escalation",
"stages": [
{
"delay_seconds": 0,
"severity": "info"
},
{
"delay_seconds": 300,
"severity": "warning"
},
{
"delay_seconds": 900,
"severity": "critical"
}
]
}💡 Pair with Workflow Escalation
Escalation conditions work great with conditional workflows. The condition bumps the incident severity over time, and your workflow uses severity-based routing to send initial notifications to Slack, then escalate to PagerDuty if the incident reaches critical.
5. Composite
Combines conditions from multiple healthchecks into a single rule using logical operators. Composite conditions create an incident only when the combined condition is met, reducing noise in complex environments. For example, create an incident only when both the API and database checks fail simultaneously.
| Field | Type | Description |
|---|---|---|
| operator | string | Logical operator: AND (all must fail) or OR (any must fail) |
| conditions | array | Array of sub-conditions, each referencing a healthcheck and failure criteria. |
| conditions[].healthcheck_id | string | ID of the healthcheck to evaluate |
| conditions[].threshold_count | integer | Number of consecutive failures for this sub-condition. Default: 1. |
{
"type": "composite",
"operator": "AND",
"conditions": [
{
"healthcheck_id": "hc_api_server",
"threshold_count": 3
},
{
"healthcheck_id": "hc_database",
"threshold_count": 2
}
],
"severity": "critical",
"cooldown_seconds": 600
}ℹ️ When to Use Composite Conditions
Use AND composite conditions to reduce false positives. If your API depends on a database, a composite that requires both to fail confirms a real outage. Use OR composite conditions to monitor redundant systems where any single failure is worth investigating.
Working Hours
Restrict when incidents are created by configuring working hours. Failures that occur outside the defined window are still recorded in healthcheck results but do not create incidents until the next working period.
{
"working_hours": {
"enabled": true,
"timezone": "America/New_York",
"schedule": {
"monday": { "start": "09:00", "end": "18:00" },
"tuesday": { "start": "09:00", "end": "18:00" },
"wednesday": { "start": "09:00", "end": "18:00" },
"thursday": { "start": "09:00", "end": "18:00" },
"friday": { "start": "09:00", "end": "18:00" }
}
}
}⚠️ Critical Severity Bypasses Working Hours
By default, conditions with critical severity always bypass working hours restrictions. Critical incidents are created immediately regardless of the schedule. This ensures genuine outages are never missed, even outside business hours.
What Happens After an Incident is Created?
Once a condition creates an incident, the incident enters the incident lifecycle (Detected, Acknowledged, Investigating, Resolved). From there, Workflows take over.
In the workflow builder, you configure everything that happens after detection:
- Destinations - where notifications go (Slack, Discord, Email, Teams, Telegram, PagerDuty, custom webhooks)
- Enrichment - add context from external APIs, format messages with templates, attach runbook links
- Routing - use conditions to send critical incidents to PagerDuty and warnings to Slack
- Escalation chains - add delays between notification stages so your team has time to respond
- Rate limiting - prevent notification floods during major outages
This separation means you can change how you get notified without touching your detection logic, and vice versa.