RLHF
Reinforcement Learning from Human Feedback
ELI5 — The Vibe Check
RLHF is like training a puppy — instead of giving the AI a textbook, you let humans rate its answers with thumbs up or thumbs down. Over time, it learns to give responses humans actually prefer. It's the reason ChatGPT sounds helpful and polite instead of spewing random internet text. Basically, humans are the dog trainers and the AI is a very eager golden retriever.
Real Talk
RLHF is a training technique where a language model is fine-tuned using human preference data. The process involves: (1) supervised fine-tuning on demonstrations, (2) training a reward model on human comparisons, and (3) optimizing the policy model via reinforcement learning (PPO) against the reward model. It aligns model behavior with human preferences beyond what supervised training alone achieves.
When You'll Hear This
"The model went through RLHF to reduce harmful outputs." / "RLHF is why Claude gives nuanced answers instead of being a chaos agent."
Related Terms
Alignment
Alignment is the AI safety challenge of making sure AI does what we actually want, not just what we literally said.
Constitutional AI
Constitutional AI is Anthropic's approach to making AI behave — instead of relying on a giant team of human reviewers, the AI essentially reviews itself us...
DPO (Direct Preference Optimization)
DPO is RLHF's more efficient younger sibling.
Reinforcement Learning
Reinforcement Learning is how you train an AI by giving it rewards and punishments instead of labeled examples.