KV Cache
ELI5 — The Vibe Check
KV cache is how LLMs remember previous tokens without recomputing them. Every time the model generates a token, it caches the key-value pairs from earlier layers. Without KV caching, each new token would cost as much as the first.
Real Talk
Key-value (KV) cache is a core optimization in transformer inference that stores the attention keys and values computed for previous tokens, avoiding redundant computation during autoregressive generation. Critical to inference efficiency at long contexts. KV cache memory grows linearly with context length and dominates GPU memory for long sequences. Optimizations: paged attention (vLLM), quantized KV, and sliding-window attention.
When You'll Hear This
"Running out of GPU memory? Check KV cache size first." / "Paged KV cache cut our memory footprint 4x."
Related Terms
Context Window
A context window is how much text an AI can 'see' at once — its working memory.
Inference
Inference is when the AI actually runs and generates output — as opposed to training, which is when it's learning.
Prefix Cache
Prefix cache is when an AI provider reuses computation from shared prompt prefixes.
Transformer
The Transformer is THE architecture behind all modern AI. ChatGPT, Claude, Midjourney, Whisper — all transformers under the hood. The key innovation?