SRE
Site Reliability Engineering
ELI5 — The Vibe Check
SRE is Google's version of DevOps with a more engineering-focused twist. SREs are software engineers who apply code and automation to the problem of keeping systems running reliably. If DevOps is a philosophy, SRE is a specific job title and set of practices with concrete metrics like SLOs and error budgets.
Real Talk
Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations problems. Originated at Google, SREs use quantitative approaches to reliability, defining SLOs, tracking error budgets, and automating toil to maintain system health at scale.
When You'll Hear This
"The SRE team owns the on-call rotation and production reliability." / "SRE said our error budget is exhausted — no new features until we fix reliability."
Related Terms
DevOps
DevOps is the culture and practice of tearing down the wall between the people who write code (Dev) and the people who run it in production (Ops).
On-call
On-call means it's your turn to be the person who gets woken up at 3am if production breaks.
Site Reliability Engineering
Site Reliability Engineering is the discipline of making sure your app stays up and runs well, using the same rigorous engineering methods you'd use to wri...
SLA (Service Level Agreement)
An SLA is a contract between you and your users about how reliable your service will be. 'We promise the app will be up 99.9% of the time.
SLI (Service Level Indicator)
An SLI is the actual measurement you track to know if you're hitting your SLO. If the SLO says 'be fast,' the SLI is the actual timer measuring speed.
SLO (Service Level Objective)
An SLO is the internal target you set for how well your service should run. It's like a personal fitness goal vs a race you signed up for.