Benchmark
ELI5 — The Vibe Check
In AI, a benchmark is a standardized test that measures how good a model is — like the SAT for AI. MMLU tests general knowledge, HumanEval tests coding, HellaSwag tests common sense. Every new model release comes with a table showing benchmark scores, and everyone immediately argues about whether the benchmarks even measure the right things.
Real Talk
AI benchmarks are standardized evaluation suites measuring model capabilities across specific dimensions. Key benchmarks include MMLU (knowledge), HumanEval/SWE-bench (coding), GSM8K (math), GPQA (expert reasoning), and ARC (common sense). While essential for comparing models, benchmarks face criticism for data contamination, narrow measurement, and potential gaming.
When You'll Hear This
"The new model topped MMLU but real-world performance is what matters." / "Benchmarks are necessary but not sufficient — you need to eval on your specific use case."
Related Terms
Evaluation
Evaluation in AI is figuring out if your model actually works — not just on test data, but in the real world.
LLM (Large Language Model)
An LLM is a humongous AI that read basically the entire internet and learned to predict what words come next, really really well.
Model
A model is the trained AI — the finished product.
Training
Training is the long, expensive process where an AI learns from data.