Mean Time to Repair is the average time between when a incidents is detected and when it is resolved.
What is the intended behavior?
Improve the ability to more rapidly resolve system instability and service outages.
How to improve it
- Make sure the pipeline alway deployable.
- Keep build cycle time short to allow roll-forward.
- Implement feature flags for larger feature changes to allow the them to be deactivated without re-deploying.
- Identify stability issues and prioritize them in the backlog.
How to game it
- Updating support incidents to “closed” before service is restored.
Metrics to use in combination with this metric to prevent unintended consequences.
- Quality decreases if issues re-occur due to lack of improving pipeline quality gates.
- “Accelerate” Ch2: Measuring Performance - Nicole Forsgren PhD, Jez Humble & Gene Kim