Incident Response
ELI5 — The Vibe Check
Incident Response is the process your team follows when production breaks. Who gets paged? Who's the incident commander? How do you communicate status to users? It's like a fire drill, but for software. Teams that practice it handle incidents in minutes; teams that don't handle them in hours with lots of chaos.
Real Talk
Incident Response is a structured process for detecting, triaging, and resolving service disruptions. Key roles include an incident commander (coordinates response), technical lead (investigates), and communications lead (updates stakeholders). The process is documented in runbooks and regularly practiced via game days.
When You'll Hear This
"Follow the incident response playbook — step one is declare an incident in Slack." / "Good incident response reduced our MTTR from 2 hours to 20 minutes."
Related Terms
Alerting
Alerting is the part of monitoring that actually wakes people up when something goes wrong.
Incident
An incident is when something has gone wrong in production and users are affected.
On-call
On-call means it's your turn to be the person who gets woken up at 3am if production breaks.
Playbook
A Playbook is like a runbook but bigger — it covers a whole category of operations, not just one specific scenario.
Postmortem
A Postmortem is the meeting you have after an incident to figure out what went wrong and how to prevent it from happening again.
Runbook
A Runbook is a step-by-step guide for handling a specific operational task or incident.