Self-Healing Recovery Actions
Configure deterministic repair recipes with allowed actions, retries, verification steps, and safety limits.
Overview
In the open-source agent, remediation is configured asrecipes (step sequences) and repairs (rules mapping checks to recipes). Source of truth: github.com/uptimy/uptimy-agent.
Only allowlisted actions can run. Forbidden actions are blocked by hardcoded guardrails.
ℹ️ Recovery Action Flow
Flow: check fails, matching repair rule triggers, recipe steps run in order, verification step(s) run, and the incident closes on success or remains open on failure.
Core Configuration Model
Repairs are configured with two blocks:
- recipes: ordered action steps with optional
retries,duration,params, andon_failure_only - repairs: rules mapping a check name to a recipe, with
max_repairs_per_hour
recipes:
- name: heal-api
steps:
- action: restart_container
retries: 2
params:
container: api
- action: wait
duration: 15s
- action: healthcheck
check: api-health
- action: webhook
on_failure_only: true
params:
url: ${ALERT_WEBHOOK_URL}
method: POST
repairs:
- rule: api-down
check: api-health
recipe: heal-api
max_repairs_per_hour: 5Allowed Actions
Recovery actions are the operations the agent can perform to fix detected issues. Recipe orchestration steps (wait, healthcheck, webhook) control the flow between recovery actions.
Recovery Actions
| Action | Key Params | Description |
|---|---|---|
| restart_container | container, timeout | Restart a Docker container |
| recreate_container | container | Remove and recreate a Docker container |
| start_container | container | Start a stopped Docker container |
| stop_container | container, timeout | Stop a running Docker container |
| update_swarm_service | service | Force-update a Docker Swarm service |
| scale_swarm_service | service, replicas | Scale a Docker Swarm service to a target replica count |
| restart_service | service | Restart a systemd service |
| start_service | service | Start a stopped systemd service |
| stop_service | service | Stop a running systemd service |
| reboot_host | delay | Reboot the host machine (use with extreme caution) |
| clear_temp | path, age | Remove old files from a directory |
| rotate_logs | path, max_size | Rotate or truncate log files |
Recipe Orchestration Steps
These steps control the flow of a recipe but are not recovery actions themselves.
| Step | Key Params | Description |
|---|---|---|
| wait | duration | Pause between repair steps |
| healthcheck | check | Re-run a check to verify remediation |
| webhook | url, method | Send an HTTP webhook notification |
Forbidden Actions
These actions are always blocked by the safety model:shell_exec, delete_files, andmodify_secrets.
Safety Guardrails
| Guardrail | Description |
|---|---|
| Rate limit | max_repairs_per_hour per repair rule |
| Cooldowns | Minimum interval between executions of the same action |
| Concurrency cap | Up to 10 concurrent repairs |
| Audit trail | Repair executions are logged and persisted (BoltDB) |
💡 Best Practices
Start with low max_repairs_per_hour values (2-3), usehealthcheck verification steps after repair actions, and add webhook escalation withon_failure_only: true.