Service Level Objectives (SLOs)
This document defines operator-facing reliability objectives for git-project-sync.
Scope
- Applies to daemon-driven synchronization in
userandsystemservice modes. - SLO compliance is evaluated over a rolling 30-day window unless noted otherwise.
Core SLOs
SLO-1: Sync Freshness
- Objective: 99% of enabled repositories are synchronized within 2 daemon intervals after the latest remote default-branch update is detectable.
- Measurement source: daemon run history + event timestamps (
sync_completed,repo_locked, retry reasons). - Alert threshold: breach risk if >2% of repos exceed freshness target for 3 consecutive cycles.
SLO-2: Sync Success Rate
- Objective: >= 99.5% of attempted repo sync jobs complete without terminal error in each 24-hour period.
- Excludes policy skips that are safety-preserving (
repo_staged_changes,repo_unstaged_changes,repo_untracked_files,repo_conflicts). - Alert threshold: terminal failures (
permanent_error,retry_budget_exceeded,update_failed) >0.5% of attempts.
SLO-3: Daemon Recovery Time
- Objective: 95% of daemon restarts recover in-flight state and resume scheduling within 5 minutes.
- Measurement source: restart timestamp, subsequent recovered run markers, and next successful cycle completion.
- Alert threshold: recovery >5 minutes for two or more consecutive restarts.
SLO-4: Update Safety
- Objective: 100% of update applies are checksum-validated before replacement; 100% of failed applies either rollback cleanly or preserve prior binary.
- Measurement source: update events (
update_started,update_succeeded,update_failed,update_rollback) and release artifact checksums.
Error Budget Policy
- Freshness error budget: 1% of repo-cycle opportunities per 30 days.
- Sync success error budget: 0.5% of attempted jobs per 24 hours.
- Recovery error budget: 5% of restart events per 30 days.
- If any error budget is exhausted:
- freeze non-essential feature rollout,
- prioritize reliability fixes and runbook hardening,
- require explicit incident owner sign-off before resuming normal rollout pace.
Severity Mapping Overview
Detailed severity matrix is maintained in docs/operations/incident-response-playbook.md.
- Sev-1: sustained global service unavailability or data safety risk.
- Sev-2: major degradation affecting multiple sources/repos or SLO breach in progress.
- Sev-3: localized degradation with workaround available.
- Sev-4: low-impact operational defect or informational anomaly.