Incident Response Playbook
This runbook defines safe triage and recovery procedures for common production incidents.
Incident Triage Flow
- Confirm daemon status:
syncctl daemon status
- Collect health and event signals:
syncctl doctorsyncctl stats showsyncctl events list --limit 100
- Drill into one failing run:
syncctl trace show <trace-id>
- Identify reason code and use the matching procedure below.
Auth Failure Recovery
Symptoms:
- Repeated
permanent_errorevents tied to a provider source. syncctl doctorindicates auth/credential issues.
Procedure:
- Verify source configuration (
syncctl source list/ config file). - Validate current token:
syncctl auth test <source-id>. - Re-authenticate if needed:
syncctl auth login <source-id>. - Rerun targeted sync:
syncctl repo sync <path> --dry-run. - Resume daemon cycle after successful validation.
Lock Contention Recovery
Symptoms:
- Frequent
repo_lockedevents for the same repositories.
Procedure:
- Confirm only one daemon instance is active.
- Check recent traces for long-running operations.
- Tune concurrency if needed:
- lower
daemon.max_parallel_repos - lower
daemon.max_parallel_per_source
- lower
- Restart daemon if lock state appears stale.
- Verify lock contention returns to normal levels in
syncctl stats show.
Restart Storm Recovery
Symptoms:
- Frequent daemon restarts while sync work is active.
- Repeated short-lived in-flight runs with degraded throughput.
Procedure:
- Stabilize process lifecycle (disable crash loop trigger, stop external restarts temporarily).
- Confirm in-flight recovery behavior after restart:
syncctl doctorsyncctl events list --limit 100
- Verify no stale in-flight state remains and new runs progress normally.
- Resume normal daemon restart policy once run completion and lock behavior normalize.
Escalation criteria:
- If repeated restart storms continue for more than 2 scheduler intervals, escalate as a service reliability incident.
- If lock contention and restart storms occur together, reduce concurrency and treat as sev-2 until stable.
Update Failure and Rollback Recovery
Symptoms:
update_failedfollowed byupdate_rollback.
Procedure:
- Confirm the service is healthy on the rolled-back version.
- Check release manifest/checksum availability and integrity.
- Inspect update trace events for exact failure phase.
- Retry update once root cause is fixed:
syncctl update checksyncctl update apply --channel stable
- If repeated failures persist, pin current version and open an incident.
Network and Provider Throttling Recovery
Symptoms:
- Repeated
network_error,timeout, orprovider_rate_limited.
Procedure:
- Validate outbound connectivity, proxy, and DNS.
- Confirm provider status page and account quota limits.
- Reduce pressure by tuning daemon concurrency and retry settings.
- Re-run
syncctl doctorand monitor event trends for 1-2 cycles.
Severity Matrix and Response Timelines
| Severity | Typical Conditions | Initial Response Target | Escalation Target |
|---|---|---|---|
sev-1 | Service-wide outage, repeated crash loops without recovery, or safety guarantees at risk | 15 minutes | 30 minutes |
sev-2 | Multi-source sync degradation, restart storm with rising backlog, or active SLO breach risk | 30 minutes | 60 minutes |
sev-3 | Localized source/repo failures, persistent lock contention, degraded but operating service | 4 hours | Next business window |
sev-4 | Low-impact anomalies, intermittent warnings, or non-critical operator friction | 1 business day | Backlog grooming |
Reason-Code to Severity Defaults
| Reason Code / Class | Default Severity | Notes |
|---|---|---|
repo_conflicts, repo_staged_changes, repo_unstaged_changes, repo_untracked_files | sev-4 | Safety-preserving skips; user action usually required |
repo_locked | sev-3 | Raise to sev-2 if sustained across many repos or accompanied by restart storm |
provider_rate_limited, network_error, timeout | sev-3 | Raise to sev-2 when widespread and affecting freshness SLO |
retry_budget_exceeded, permanent_error | sev-2 | Can become sev-1 if global and prolonged |
update_failed, update_rollback | sev-2 | Raise to sev-1 if rollback fails or binary health is degraded |
install_* failures in production rollout | sev-3 | Raise based on blast radius (fleet-wide => sev-2) |
Post-Incident Checklist
- Capture trace IDs and top reason codes from the incident window.
- Document operator actions taken and resulting state.
- Add follow-up items for config tuning, token rotation, or docs updates.
- If safety guards triggered (
repo_*,non_fast_forward), keep manual intervention audit notes.