Skip to content

Service Level Objectives (SLOs)

This document defines operator-facing reliability objectives for git-project-sync.

Scope

  • Applies to daemon-driven synchronization in user and system service modes.
  • SLO compliance is evaluated over a rolling 30-day window unless noted otherwise.

Core SLOs

SLO-1: Sync Freshness

  • Objective: 99% of enabled repositories are synchronized within 2 daemon intervals after the latest remote default-branch update is detectable.
  • Measurement source: daemon run history + event timestamps (sync_completed, repo_locked, retry reasons).
  • Alert threshold: breach risk if >2% of repos exceed freshness target for 3 consecutive cycles.

SLO-2: Sync Success Rate

  • Objective: >= 99.5% of attempted repo sync jobs complete without terminal error in each 24-hour period.
  • Excludes policy skips that are safety-preserving (repo_staged_changes, repo_unstaged_changes, repo_untracked_files, repo_conflicts).
  • Alert threshold: terminal failures (permanent_error, retry_budget_exceeded, update_failed) >0.5% of attempts.

SLO-3: Daemon Recovery Time

  • Objective: 95% of daemon restarts recover in-flight state and resume scheduling within 5 minutes.
  • Measurement source: restart timestamp, subsequent recovered run markers, and next successful cycle completion.
  • Alert threshold: recovery >5 minutes for two or more consecutive restarts.

SLO-4: Update Safety

  • Objective: 100% of update applies are checksum-validated before replacement; 100% of failed applies either rollback cleanly or preserve prior binary.
  • Measurement source: update events (update_started, update_succeeded, update_failed, update_rollback) and release artifact checksums.

Error Budget Policy

  • Freshness error budget: 1% of repo-cycle opportunities per 30 days.
  • Sync success error budget: 0.5% of attempted jobs per 24 hours.
  • Recovery error budget: 5% of restart events per 30 days.
  • If any error budget is exhausted:
    • freeze non-essential feature rollout,
    • prioritize reliability fixes and runbook hardening,
    • require explicit incident owner sign-off before resuming normal rollout pace.

Severity Mapping Overview

Detailed severity matrix is maintained in docs/operations/incident-response-playbook.md.

  • Sev-1: sustained global service unavailability or data safety risk.
  • Sev-2: major degradation affecting multiple sources/repos or SLO breach in progress.
  • Sev-3: localized degradation with workaround available.
  • Sev-4: low-impact operational defect or informational anomaly.