Layer Normalization
ELI5 — The Vibe Check
Layer normalization is batch norm's sibling — but instead of normalizing across the batch, it normalizes across the features of each individual example. This makes it work great for sequences and transformers, where batch sizes vary and order matters. It's the normalization technique of choice for every modern LLM.
Real Talk
Layer Normalization normalizes across the feature dimension of each individual sample, independent of other samples in the batch. Unlike batch normalization, it doesn't depend on batch statistics, making it suitable for variable-length sequences, small batch sizes, and autoregressive models. It's the standard normalization in Transformer architectures (pre-norm or post-norm variants).
When You'll Hear This
"Transformers use layer norm, not batch norm." / "Pre-layer-norm makes training more stable than post-layer-norm."
Related Terms
Batch Normalization
Batch normalization is like hitting the reset button on each layer of a neural network so the numbers don't spiral out of control.
Deep Learning
Deep Learning is Machine Learning that's been hitting the gym.
Neural Network
A neural network is a system loosely inspired by the human brain — lots of little math nodes connected together, passing numbers to each other.
Training
Training is the long, expensive process where an AI learns from data.
Transformer
The Transformer is THE architecture behind all modern AI. ChatGPT, Claude, Midjourney, Whisper — all transformers under the hood. The key innovation?