upti.my

Incident Conditions

Define the conditions that determine when incidents are created. Incident conditions evaluate healthcheck results and create incidents when real problems are detected.

Overview

Incident conditions are the rules that tell upti.my when to create an incident. Every healthcheck can have one or more conditions attached to it. When a condition is met, upti.my automatically opens a new incident and records the affected healthcheck, timestamp, and severity.

Conditions focus on one thing: deciding whether a problem is real enough to warrant an incident. They do not handle notifications, enrichment, or routing. That part is handled by Workflows, where you configure destinations, message formatting, escalation chains, and everything else using a visual drag-and-drop builder.

ℹ️ Conditions Create. Workflows Notify.

Think of it this way: incident conditions decide when an incident is created. Workflows decide what happens next. This separation keeps your setup clean. You define detection logic in conditions and notification logic in workflows.

Common Settings

All condition types share the following configurable settings:

SettingDescription
SeverityThe severity assigned to incidents created by this condition: critical, warning, or info. Workflows can use severity to route notifications to the right channels.
Cooldown PeriodMinimum time between incidents from the same condition. Prevents creating duplicate incidents for the same ongoing problem.
Working HoursOptionally restrict incident creation to specific hours and days. Issues outside working hours are still recorded but do not create incidents until the next working period.
TagsOrganize and filter conditions with custom tags. Tags carry through to the created incident and can be used in workflow routing.

Condition Types

1. Simple

The most straightforward condition. It creates an incident when a healthcheck fails a specified number of consecutive times. This is the default condition type and works well for clear-cut "is it up or down" monitoring.

FieldTypeDescription
threshold_countintegerNumber of consecutive failures before creating an incident. Default: 3.
Simple Condition Example
{
  "type": "simple",
  "threshold_count": 3,
  "severity": "critical",
  "cooldown_seconds": 300
}

2. Threshold

Creates an incident based on the percentage of failures within a rolling time window. This is useful for services that have occasional transient failures. You might tolerate 10% failure but want an incident at 50%.

FieldTypeDescription
failure_percentageinteger (0-100)Failure percentage that triggers incident creation, e.g., 50 means 50% failures.
window_secondsintegerRolling time window in seconds. Default: 300 (5 minutes).
Threshold Condition Example
{
  "type": "threshold",
  "failure_percentage": 50,
  "window_seconds": 600,
  "severity": "warning",
  "cooldown_seconds": 600
}

ℹ️ Window Size Matters

Shorter windows (1 to 5 minutes) detect issues faster but may create incidents from transient failures. Longer windows (10 to 30 minutes) are more stable but slower to react. Match the window size to the criticality of the service.

3. Pattern

Detects specific failure patterns rather than simple counts or percentages. Pattern conditions excel at identifying flapping services (rapidly alternating between up and down) and specific sequences of errors that indicate a degrading system.

FieldTypeDescription
pattern_typestringPattern to detect: flapping or consecutive_errors
flap_thresholdintegerFor flapping: number of state changes within the window that create an incident. Default: 5.
consecutive_countintegerFor consecutive_errors: number of errors in a row. Default: 5.
window_secondsintegerTime window for pattern evaluation. Default: 600 (10 minutes).
Flapping Detection Example
{
  "type": "pattern",
  "pattern_type": "flapping",
  "flap_threshold": 5,
  "window_seconds": 600,
  "severity": "warning",
  "cooldown_seconds": 900
}

4. Escalation

Multi-stage conditions that create incidents with increasing severity over time if the problem persists. Each stage has its own delay and severity level. The incident is initially created at the first stage's severity, then automatically escalated through subsequent stages if unresolved.

FieldTypeDescription
stagesarrayArray of escalation stages, each with a delay and severity.
stages[].delay_secondsintegerTime in seconds after the initial failure before this stage activates.
stages[].severitystringSeverity for this stage: info, warning, or critical
Escalation Condition Example
{
  "type": "escalation",
  "stages": [
    {
      "delay_seconds": 0,
      "severity": "info"
    },
    {
      "delay_seconds": 300,
      "severity": "warning"
    },
    {
      "delay_seconds": 900,
      "severity": "critical"
    }
  ]
}

💡 Pair with Workflow Escalation

Escalation conditions work great with conditional workflows. The condition bumps the incident severity over time, and your workflow uses severity-based routing to send initial notifications to Slack, then escalate to PagerDuty if the incident reaches critical.

5. Composite

Combines conditions from multiple healthchecks into a single rule using logical operators. Composite conditions create an incident only when the combined condition is met, reducing noise in complex environments. For example, create an incident only when both the API and database checks fail simultaneously.

FieldTypeDescription
operatorstringLogical operator: AND (all must fail) or OR (any must fail)
conditionsarrayArray of sub-conditions, each referencing a healthcheck and failure criteria.
conditions[].healthcheck_idstringID of the healthcheck to evaluate
conditions[].threshold_countintegerNumber of consecutive failures for this sub-condition. Default: 1.
Composite Condition Example (AND)
{
  "type": "composite",
  "operator": "AND",
  "conditions": [
    {
      "healthcheck_id": "hc_api_server",
      "threshold_count": 3
    },
    {
      "healthcheck_id": "hc_database",
      "threshold_count": 2
    }
  ],
  "severity": "critical",
  "cooldown_seconds": 600
}

ℹ️ When to Use Composite Conditions

Use AND composite conditions to reduce false positives. If your API depends on a database, a composite that requires both to fail confirms a real outage. Use OR composite conditions to monitor redundant systems where any single failure is worth investigating.

Working Hours

Restrict when incidents are created by configuring working hours. Failures that occur outside the defined window are still recorded in healthcheck results but do not create incidents until the next working period.

Working Hours Configuration
{
  "working_hours": {
    "enabled": true,
    "timezone": "America/New_York",
    "schedule": {
      "monday": { "start": "09:00", "end": "18:00" },
      "tuesday": { "start": "09:00", "end": "18:00" },
      "wednesday": { "start": "09:00", "end": "18:00" },
      "thursday": { "start": "09:00", "end": "18:00" },
      "friday": { "start": "09:00", "end": "18:00" }
    }
  }
}

⚠️ Critical Severity Bypasses Working Hours

By default, conditions with critical severity always bypass working hours restrictions. Critical incidents are created immediately regardless of the schedule. This ensures genuine outages are never missed, even outside business hours.

What Happens After an Incident is Created?

Once a condition creates an incident, the incident enters the incident lifecycle (Detected, Acknowledged, Investigating, Resolved). From there, Workflows take over.

In the workflow builder, you configure everything that happens after detection:

  • Destinations - where notifications go (Slack, Discord, Email, Teams, Telegram, PagerDuty, custom webhooks)
  • Enrichment - add context from external APIs, format messages with templates, attach runbook links
  • Routing - use conditions to send critical incidents to PagerDuty and warnings to Slack
  • Escalation chains - add delays between notification stages so your team has time to respond
  • Rate limiting - prevent notification floods during major outages

This separation means you can change how you get notified without touching your detection logic, and vice versa.