Incident Management
Track, manage, and resolve incidents with automatic detection, timeline tracking, and post-incident analysis.
Overview
upti.my provides a built-in incident management system that tracks every outage from detection to resolution. Incidents are created automatically when incident conditions are met, or manually when your team identifies an issue through other channels. Every incident captures a complete timeline, affected services, linked health checks, and resolution details.
The incident system integrates tightly with incident conditions, workflows, and status pages. When an incident is created, your workflows execute to send notifications and route alerts. As the incident progresses through its lifecycle, status updates can be published to your public or private status pages automatically.
Incident Lifecycle
Every incident follows a structured lifecycle with four stages. Moving between stages is tracked with timestamps for accurate metric calculation.
| Stage | Description |
|---|---|
| Detected | The incident has been identified, either automatically by an incident condition or manually by a team member. Workflows execute to send notifications and the timeline begins. |
| Acknowledged | A team member has acknowledged the incident and is aware of the issue. This stops escalation timers in escalation conditions. |
| Investigating | The team is actively investigating the root cause. Status updates can be posted to keep stakeholders informed. |
| Resolved | The issue has been fixed and the affected services are back to normal. Resolution time is recorded and MTTR is calculated. |
ℹ️ Automatic Incident Creation
When an incident condition is met, upti.my automatically creates an incident if one does not already exist for the affected health check. If an open incident already exists, the new event is linked to the existing incident instead of creating a duplicate.
Incident Timeline
Every incident maintains a detailed timeline that records all events from detection to resolution. Timeline entries are created automatically for system events and can be added manually by team members.
- Condition triggered - Records which incident condition was met and which health check failed
- Stage transitions - Timestamps for each lifecycle stage change (Detected, Acknowledged, Investigating, Resolved)
- Status updates - Manual updates posted by team members with free-text descriptions
- Affected services changed - When services are added to or removed from the incident
- Team assignments - When responders are assigned or changed
- Recovery actions executed - If self-healing actions ran, their results are logged here
{
"timestamp": "2025-06-15T14:32:00Z",
"event_type": "status_update",
"author": "jane@example.com",
"message": "Root cause identified: database connection pool exhaustion. Scaling up pool size from 20 to 50 connections.",
"stage": "investigating"
}Affected Services
Each incident can be linked to one or more services that are impacted. Services are automatically associated based on the health checks that triggered the alert. You can also manually add or remove affected services as the incident investigation reveals the full scope of impact.
Affected services are displayed on your status pages, giving your users real-time visibility into which parts of your system are experiencing issues.
Status Page Integration
Incident updates can be automatically published to your upti.my status pages. When you post a status update to an incident, you can choose to push it to one or more status pages. This keeps your users informed without requiring manual updates to your status page.
- Automatic publishing - Configure incidents to auto-publish to status pages when created
- Selective updates - Choose which status updates are public-facing and which are internal-only
- Service impact levels - Set the impact level per service: Operational, Degraded Performance, Partial Outage, Major Outage
- Scheduled maintenance - Create planned incidents that appear on the status page before the maintenance window
💡 Communicate Early and Often
Post an initial status update within 5 minutes of detection, even if you do not yet know the root cause. Users appreciate knowing that you are aware of the issue. Follow up with updates every 15 to 30 minutes until the incident is resolved.
Response Team Assignment
Assign responders to incidents to track who is working on the issue. Responders are notified when they are assigned and receive all subsequent status updates for the incident. You can configure default response teams per service or per incident condition.
| Feature | Description |
|---|---|
| Default responders | Configure default team members who are automatically assigned when an incident is created for a specific service |
| Manual assignment | Add or change responders at any time during the incident lifecycle |
| On-call integration | Integrate with PagerDuty or OpsGenie to automatically assign the current on-call engineer |
Incident Metrics
upti.my calculates key incident metrics automatically, giving your team data-driven insights into your reliability and response performance.
| Metric | Description |
|---|---|
| MTTD (Mean Time to Detect) | Average time from when a failure starts to when it is detected by a health check or alert. Lower MTTD means faster detection. |
| MTTR (Mean Time to Recover) | Average time from incident detection to resolution. This is the primary measure of your team's response efficiency. |
| Incident Count | Total number of incidents over a given period, broken down by severity, service, and status. |
| Uptime Percentage | Calculated from incident duration relative to total monitored time. Displayed on status pages and dashboards. |
{
"period": "2025-06-01T00:00:00Z/2025-06-30T23:59:59Z",
"metrics": {
"mttd_seconds": 45,
"mttr_seconds": 1230,
"incident_count": 7,
"uptime_percentage": 99.94,
"by_severity": {
"critical": 2,
"warning": 3,
"info": 2
}
}
}Post-Incident Analysis
After an incident is resolved, upti.my provides tools for post-incident analysis. Each incident includes a dedicated notes section where your team can document root cause, contributing factors, and action items. This helps your team learn from incidents and improve reliability over time.
- Root cause documentation - Record what caused the incident and how it was identified
- Contributing factors - Document environmental or systemic factors that contributed to the issue
- Action items - Track follow-up tasks to prevent recurrence
- Timeline review - Review the complete incident timeline to identify response bottlenecks
- Metric comparison - Compare MTTD and MTTR against your team's historical averages
ℹ️ Blameless Post-Mortems
upti.my encourages blameless post-incident analysis. Focus on systemic improvements rather than individual blame. The incident timeline and metrics provide objective data points for identifying process improvements and automation opportunities.
Creating Incidents Manually
While most incidents are created automatically by incident conditions, you can also create incidents manually for issues detected through other channels. Manual incidents support all the same features: lifecycle tracking, timeline, affected services, status page publishing, and metrics.
{
"title": "Elevated API latency in EU region",
"description": "Users in the EU region are reporting slow API response times. CDN cache hit rate has dropped significantly.",
"severity": "warning",
"affected_services": ["api-eu", "cdn-eu"],
"responders": ["oncall@example.com"],
"publish_to_status_pages": ["public-status"]
}⚠️ Duplicate Prevention
Before creating a manual incident, check the active incidents list to ensure a related incident does not already exist. If a related incident is open, add your findings as a status update to the existing incident instead of creating a new one.