Incident Management

Track, manage, and resolve incidents with automatic detection, timeline tracking, and post-incident analysis.

Overview

upti.my provides a built-in incident management system that tracks every outage from detection to resolution. Incidents are created automatically when incident conditions are met, or manually when your team identifies an issue through other channels. Every incident captures a complete timeline, affected services, linked health checks, and resolution details.

The incident system integrates tightly with incident conditions, workflows, and status pages. When an incident is created, your workflows execute to send notifications and route alerts. As the incident progresses through its lifecycle, status updates can be published to your public or private status pages automatically.

Incident Lifecycle

Every incident follows a structured lifecycle with five statuses. Moving between statuses is tracked with timestamps for accurate metric calculation.

Status	Description
Ongoing	The incident has been created and is active. Workflows execute to send notifications and the timeline begins.
Investigating	The team is actively investigating the root cause. Status updates can be posted to keep stakeholders informed.
Identified	The root cause has been identified and the team is working on a fix. This signals progress to stakeholders.
Monitoring	A fix has been applied and the team is monitoring to confirm the issue is fully resolved. This stops escalation timers.
Resolved	The issue has been fixed and the affected services are back to normal. Resolution time is recorded and MTTR is calculated.

ℹ️ Automatic Incident Creation

When an incident condition is met, upti.my automatically creates an incident if one does not already exist for the affected health check. If an open incident already exists, the new event is linked to the existing incident instead of creating a duplicate.

Incident Timeline

Every incident maintains a detailed timeline that records all events from detection to resolution. Timeline entries are created automatically for system events and can be added manually by team members.

Condition triggered - Records which incident condition was met and which health check failed
Status transitions - Timestamps for each status change (Ongoing, Investigating, Identified, Monitoring, Resolved)
Status updates - Manual updates posted by team members with free-text descriptions

Affected Services

Each incident can be linked to one or more services that are impacted. Services are automatically associated based on the health checks that triggered the alert. You can also manually add or remove affected services as the incident investigation reveals the full scope of impact.

Affected services are displayed on your status pages, giving your users real-time visibility into which parts of your system are experiencing issues.

Status Page Integration

By default, incidents are created as private, visible only to your team in the dashboard. This lets you triage and investigate without exposing anything to your users. When you are ready to communicate, you can make an incident public to display it on one or more of your status pages.

Private by default - New incidents are only visible to your team. Workflows still fire and your team is notified, but nothing appears on public status pages.
Make public - When you want to communicate with your users, mark the incident as public and select which status pages it should appear on.
Selective updates - Choose which status updates are public-facing and which remain internal-only

💡 Communicate Early and Often

Post an initial status update within 5 minutes of detection, even if you do not yet know the root cause. Users appreciate knowing that you are aware of the issue. Follow up with updates every 15 to 30 minutes until the incident is resolved.

Response Team Assignment

Assign responders to incidents to track who is working on the issue. Responders are notified when they are assigned and receive all subsequent status updates for the incident. You can configure default response teams per service or per incident condition.

Feature	Description
Default responders	Configure default team members who are automatically assigned when an incident is created for a specific service
Manual assignment	Add or change responders at any time during the incident lifecycle
On-call integration	Integrate with PagerDuty to automatically assign the current on-call engineer

Incident Metrics

upti.my calculates key incident metrics automatically, giving your team data-driven insights into your reliability and response performance.

Metric	Description
MTTD (Mean Time to Detect)	Average time from when a failure starts to when it is detected by a health check or alert. Lower MTTD means faster detection.
MTTR (Mean Time to Recover)	Average time from incident detection to resolution. This is the primary measure of your team's response efficiency.
Incident Count	Total number of incidents over a given period, broken down by severity, service, and status.
Uptime Percentage	Calculated from incident duration relative to total monitored time. Displayed on status pages and dashboards.

Post-Incident Analysis

After an incident is resolved, upti.my provides tools for post-incident analysis. Each incident includes a dedicated notes section where your team can document root cause, contributing factors, and action items. This helps your team learn from incidents and improve reliability over time.

Root cause documentation - Record what caused the incident and how it was identified
Contributing factors - Document environmental or systemic factors that contributed to the issue
Action items - Track follow-up tasks to prevent recurrence
Timeline review - Review the complete incident timeline to identify response bottlenecks
Metric comparison - Compare MTTD and MTTR against your team's historical averages

ℹ️ Blameless Post-Mortems

upti.my encourages blameless post-incident analysis. Focus on systemic improvements rather than individual blame. The incident timeline and metrics provide objective data points for identifying process improvements and automation opportunities.

Creating Incidents Manually

While most incidents are created automatically by incident conditions, you can also create incidents manually from the dashboard. Click "Create Incident" and fill in the title, description, severity, and affected services. Manual incidents support all the same features: lifecycle tracking, timeline, affected services, status page publishing, and metrics.

⚠️ Duplicate Prevention

Before creating a manual incident, check the active incidents list to ensure a related incident does not already exist. If a related incident is open, add your findings as a status update to the existing incident instead of creating a new one.