Runbook
ELI5 — The Vibe Check
A Runbook is a step-by-step guide for handling a specific operational task or incident. It's like the instruction manual for when things go wrong — 'database is slow, follow these steps.' When you get paged at 3am with a foggy brain, a good runbook means you don't have to figure everything out from scratch.
Real Talk
A runbook is a documented set of procedures for performing operational tasks, particularly for incident response. They range from fully automated (auto-remediation scripts) to human-executed checklists. Runbooks reduce MTTR, enable junior engineers to handle incidents, and capture institutional knowledge.
Show Me The Code
# Runbook: Database Connection Exhausted
## Symptoms
- 503 errors on /api endpoints
- DB connection pool metric > 95%
## Steps
1. Check active connections: SELECT count(*) FROM pg_stat_activity
2. Kill idle connections: SELECT pg_terminate_backend(pid)...
3. Restart app pods: kubectl rollout restart deployment/api
When You'll Hear This
"Write a runbook for the most common incidents so the on-call rotation isn't miserable." / "Follow the database runbook — don't improvise during an incident."
Related Terms
Incident Response
Incident Response is the process your team follows when production breaks. Who gets paged? Who's the incident commander?
On-call
On-call means it's your turn to be the person who gets woken up at 3am if production breaks.
Playbook
A Playbook is like a runbook but bigger — it covers a whole category of operations, not just one specific scenario.
Postmortem
A Postmortem is the meeting you have after an incident to figure out what went wrong and how to prevent it from happening again.
SRE (Site Reliability Engineering)
SRE is Google's version of DevOps with a more engineering-focused twist.