Jailbreak
ELI5 — The Vibe Check
A jailbreak is a sneaky prompt that tricks an AI into ignoring its safety rules. It's like convincing the strict teacher to let you skip homework by telling an elaborate story. People craft these creative prompts to make the AI do things it normally wouldn't — like generating harmful content or pretending it has no rules. AI labs play constant whack-a-mole patching them.
Real Talk
In the context of AI, a jailbreak is a prompt injection technique designed to bypass a language model's safety guardrails and content filters. Methods include role-playing scenarios, hypothetical framings, character switching, and multi-step social engineering. AI providers continuously update models to resist known jailbreak patterns while maintaining helpful behavior.
When You'll Hear This
"Someone posted a new jailbreak on Twitter — the safety team is on it." / "The latest model is much more resistant to jailbreaks."
Related Terms
AI Safety
AI Safety is the field of making sure AI doesn't go off the rails.
Alignment
Alignment is the AI safety challenge of making sure AI does what we actually want, not just what we literally said.
Prompt Injection
Prompt injection is the SQL injection of the AI world.
Red Teaming
Red teaming in AI is trying to break the AI on purpose — like hiring someone to try to rob your bank so you can find the security holes.