upti.my

From detection to resolution, in one place

Incident Management

Incidents are created from real check failures with full context. Alerts route to the right team. Status pages update automatically. No more switching between tools during an outage.

The Problem with Disconnected Incident Management

When your monitoring tool and your incident tracker are separate systems, every incident starts with manual work: someone has to notice the alert, create the incident, copy the context, notify the team, and update the status page. Each step takes time and can be forgotten.

During an outage, time is the most expensive resource. Every minute spent on coordination is a minute not spent on resolution.

Delayed incident creation

Someone has to see the alert and decide to create an incident. If they are asleep or in a meeting, the incident is not tracked.

Lost context between tools

The monitoring tool knows what failed and when. The incident tool does not. Someone copies the information manually, often incomplete.

Status page falls behind

Updating the status page is a separate task that gets deprioritized during an active incident. Customers see no communication.

Timeline is reconstructed after the fact

Without automatic tracking, the incident timeline has to be rebuilt from Slack messages and memory during the retrospective.

How It Works in upti.my

1
Health check detects a failure (or multiple failures match your conditions)
2
Incident is created automatically with full diagnostic context
3
Alert workflow routes the notification to the right team member
4
Status page updates automatically for affected components
5
Self-healing agent attempts automated recovery (if configured)
6
Check recovers, incident resolves, status page updates

Every step happens automatically. The team that responds starts with context and a timeline, not a blank incident form.

What an Incident Looks Like in Practice

Your API /v2/users starts returning 503 errors. Here is what happens when monitoring and incident management are connected:

1
Check fails from Frankfurt, Mumbai, and Johannesburg within 60 seconds
2
Incident created: "API /v2/users returning 503" with response bodies, latency, and affected regions attached
3
Alert workflow routes to backend-oncall via Slack. Escalation: SMS to team lead after 10 minutes if unacknowledged
4
Status page component "User API" moves to Degraded Performance automatically
5
Self-healing agent restarts api-server-2 (the node returning errors)
6
Check recovers from all regions. Incident auto-resolves. Status page returns to Operational

What the responder sees when opening the incident:

  • Which check failed and from which regions
  • Response status code and body from each location
  • When the failure started and how long it has been active
  • Related incidents affecting the same service
  • Self-healing action log (what ran, whether it worked)
  • Full notification timeline (who was paged, who acknowledged)

Total time from detection to recovery: under 2 minutes. The on-call engineer was notified, the status page was updated, and the service was restarted. All from one platform.

Incident detail page showing timeline, affected check, response data, notification log, and self-healing action status

What Incident Management Includes

Automatic incident creation

Define conditions: number of failures, duration, regions. Incidents are created when conditions match.

Connected alert routing

Incidents trigger your alert workflow. Route by severity, service, or team. Escalate if unacknowledged.

Status page integration

Affected components update automatically. Add manual updates during the incident for additional context.

Full incident timeline

Creation, notifications, acknowledgment, updates, and resolution are all tracked automatically.

Team coordination

See who has been notified, who has acknowledged, and what actions have been taken.

Post-incident data

The full timeline is preserved for retrospectives. No manual reconstruction required.

Frequently Asked Questions

Incidents can be created automatically when a health check fails based on conditions you define (number of failures, duration, affected regions). You can also create incidents manually. Either way, the incident carries full context from the monitoring data.

Yes. When an incident is created, it can automatically create a status page update for affected components. When the incident resolves, the status page updates again. You can also add manual updates during the incident for additional communication.

Alert routing workflows control who gets notified and when. You can set up escalation chains so if the first responder does not acknowledge within a time window, the alert escalates to the next person or team. Routing can be based on severity, service, time of day, or custom conditions.

Each incident includes which check failed, when, from which monitoring location, what the response looked like, how long the failure has been active, and any related incidents. Your team starts investigating with data, not guesswork.

Yes. The full incident timeline is preserved: when it was created, who was notified, what actions were taken, when it was acknowledged, and when it resolved. This data is available for retrospectives without manual note-taking.

Related Topics

Run reliability as one connected workflow

Detect failures early, route alerts clearly, coordinate incidents, and keep status updates in sync from one system.