Multi-Head Attention
ELI5 — The Vibe Check
Multi-head attention is running multiple attention mechanisms in parallel — like having several detectives investigate the same crime scene but looking for different clues. One head might focus on grammar, another on meaning, another on long-range references. Each 'head' has its own perspective, and together they build a richer understanding. It's why transformers are so powerful.
Real Talk
Multi-head attention runs self-attention multiple times in parallel with different learned linear projections of the queries, keys, and values. Each head can learn to attend to different aspects of the input (syntactic relationships, semantic similarity, positional patterns). The outputs are concatenated and projected, providing the model with diverse representational subspaces.
When You'll Hear This
"The model uses 32 attention heads per layer." / "Pruning half the attention heads barely hurt performance."
Related Terms
Attention Mechanism
The attention mechanism is how AI decides what to focus on — like when you're reading a long email and your eyes jump to the part that mentions your name.
Deep Learning
Deep Learning is Machine Learning that's been hitting the gym.
Neural Network
A neural network is a system loosely inspired by the human brain — lots of little math nodes connected together, passing numbers to each other.
Self-Attention
Self-attention is how a model looks at a sentence and figures out which words are most important to each other.
Transformer
The Transformer is THE architecture behind all modern AI. ChatGPT, Claude, Midjourney, Whisper — all transformers under the hood. The key innovation?