Mean Time to Recovery
ELI5 — The Vibe Check
Mean Time to Recovery (MTTR) is how fast you bounce back from production incidents. Elite teams recover in under an hour. If it takes you a day to recover from an outage, that's a problem. MTTR rewards fast detection, good runbooks, and easy rollback — not preventing all failures (that's impossible).
Real Talk
Mean Time to Recovery is a DORA metric measuring the average time between the start of a production incident and service restoration. Elite performers achieve less than one hour. Low MTTR requires fast detection (monitoring, alerting), clear incident procedures, automated rollback capabilities, and practiced response.
When You'll Hear This
"Our MTTR improved from 4 hours to 30 minutes after implementing automated rollback." / "Feature flags let us disable broken features in seconds — that's how we keep MTTR low."
Related Terms
Change Failure Rate
Change Failure Rate is the percentage of deployments that cause a production incident. Deploy 100 times, 5 cause issues? That's 5% CFR.
DORA Metrics
DORA Metrics are four numbers that measure how well your team delivers software: how often you deploy, how fast code goes from commit to production, how of...
Incident Response
Incident Response is the process your team follows when production breaks. Who gets paged? Who's the incident commander?
Rollback
A rollback is the panic button. When you deploy something and it breaks production, you hit rollback and the system reverts to the last working version — l...