Chaos Monkey
ELI5 — The Vibe Check
Netflix built a program that randomly kills servers in production — on purpose. It's like hiring someone to randomly unplug things in your office to make sure nothing important crashes. If your system can survive Chaos Monkey, it can survive almost anything. Deliberately causing chaos to build resilience.
Real Talk
A tool originally created by Netflix that randomly terminates production instances to test system resilience and ensure services can handle unexpected failures. Part of the Simian Army suite, it enforces the design principle that services must be stateless and fault-tolerant by continuously testing in production.
When You'll Hear This
"Chaos Monkey killed 3 instances during lunch and nobody noticed — our auto-scaling handled it." / "We run Chaos Monkey during business hours because that's when failures need to be survivable."
Related Terms
Canary Analysis
Named after canaries in coal mines — you send a small version of your new deployment into production first to see if it dies. Route 5% of traffic to the ne
Gremlin
Gremlin is the enterprise version of breaking things on purpose. It's a commercial chaos engineering platform with a nice UI where you can inject CPU spike
Litmus Chaos
Litmus Chaos is like Chaos Monkey but specifically for Kubernetes. It's a chaos engineering platform that can kill pods, stress CPUs, inject network latenc