Prompt Caching
ELI5 — The Vibe Check
Prompt caching is a speed and cost optimization where the AI remembers the beginning of your prompt so it doesn't have to re-process it every time. If your system prompt is 5,000 tokens and you send 100 messages, the model processes those 5,000 tokens once and reuses the cached result. It's like pre-heating the oven — the first pizza takes 20 minutes, but every pizza after that takes 5.
Real Talk
Prompt caching allows API providers to cache the key-value (KV) pairs computed during processing of a prompt prefix. Subsequent requests with the same prefix skip the prefill computation, reducing latency and cost (typically 90% cheaper for cached tokens). Anthropic and OpenAI both offer prompt caching for their APIs, particularly beneficial for long system prompts and few-shot examples.
When You'll Hear This
"Prompt caching cut our API costs by 60% since our system prompt is huge." / "The cached prefix makes responses start in under 200ms."
Related Terms
API (Application Programming Interface)
An API is like a menu at a restaurant. The kitchen (server) can do a bunch of things, but you can only order what's on the menu.
Context Window
A context window is how much text an AI can 'see' at once — its working memory.
Inference
Inference is when the AI actually runs and generates output — as opposed to training, which is when it's learning.
System Prompt
A system prompt is the secret instruction manual you give the AI before the conversation starts. It sets the personality, rules, knowledge, and behavior.
Token
In AI-land, a token is a chunk of text — roughly 3/4 of a word.