upti.my

Self-Healing Recovery Actions

Configure deterministic repair recipes with allowed actions, retries, verification steps, and safety limits.

Overview

In the open-source agent, remediation is configured asrecipes (step sequences) and repairs (rules mapping checks to recipes). Source of truth: github.com/uptimy/uptimy-agent.

Only allowlisted actions can run. Forbidden actions are blocked by hardcoded guardrails.

ℹ️ Recovery Action Flow

Flow: check fails, matching repair rule triggers, recipe steps run in order, verification step(s) run, and the incident closes on success or remains open on failure.

Core Configuration Model

Repairs are configured with two blocks:

  • recipes: ordered action steps with optionalretries, duration, params, and on_failure_only
  • repairs: rules mapping a check name to a recipe, with max_repairs_per_hour
Recipe + Repair Rule Example
recipes:
  - name: heal-api
    steps:
      - action: restart_container
        retries: 2
        params:
          container: api
      - action: wait
        duration: 15s
      - action: healthcheck
        check: api-health
      - action: webhook
        on_failure_only: true
        params:
          url: ${ALERT_WEBHOOK_URL}
          method: POST

repairs:
  - rule: api-down
    check: api-health
    recipe: heal-api
    max_repairs_per_hour: 5

Allowed Actions

Recovery actions are the operations the agent can perform to fix detected issues. Recipe orchestration steps (wait, healthcheck, webhook) control the flow between recovery actions.

Recovery Actions

ActionKey ParamsDescription
restart_containercontainer, timeoutRestart a Docker container
recreate_containercontainerRemove and recreate a Docker container
start_containercontainerStart a stopped Docker container
stop_containercontainer, timeoutStop a running Docker container
update_swarm_serviceserviceForce-update a Docker Swarm service
scale_swarm_serviceservice, replicasScale a Docker Swarm service to a target replica count
restart_serviceserviceRestart a systemd service
start_serviceserviceStart a stopped systemd service
stop_serviceserviceStop a running systemd service
reboot_hostdelayReboot the host machine (use with extreme caution)
clear_temppath, ageRemove old files from a directory
rotate_logspath, max_sizeRotate or truncate log files

Recipe Orchestration Steps

These steps control the flow of a recipe but are not recovery actions themselves.

StepKey ParamsDescription
waitdurationPause between repair steps
healthcheckcheckRe-run a check to verify remediation
webhookurl, methodSend an HTTP webhook notification

Forbidden Actions

These actions are always blocked by the safety model:shell_exec, delete_files, andmodify_secrets.

Safety Guardrails

GuardrailDescription
Rate limitmax_repairs_per_hour per repair rule
CooldownsMinimum interval between executions of the same action
Concurrency capUp to 10 concurrent repairs
Audit trailRepair executions are logged and persisted (BoltDB)

💡 Best Practices

Start with low max_repairs_per_hour values (2-3), usehealthcheck verification steps after repair actions, and add webhook escalation withon_failure_only: true.