Inference
ELI5 — The Vibe Check
Inference is when the AI actually runs and generates output — as opposed to training, which is when it's learning. When you type a prompt and hit enter, that's inference. Training is the expensive months-long process; inference is the moment-to-moment work of generating answers. Inference costs are what your API bill is.
Real Talk
Inference is the process of running a trained model on new input to generate predictions or outputs. For LLMs, inference involves a forward pass through the network for each generated token. Inference speed and cost are determined by model size, hardware (GPU/TPU), batching, and optimization techniques like quantization and KV caching.
When You'll Hear This
"Inference latency is too high — users see a 5 second delay." / "We run inference on A100 GPUs."
Related Terms
GPU (Graphics Processing Unit)
A GPU was originally built for rendering graphics in games, but turns out it's also perfect for AI.
Model
A model is the trained AI — the finished product.
Temperature
Temperature controls how creative (or chaotic) an AI's responses are. Low temperature (like 0.1) makes it boring, safe, and predictable — great for code.
Token
In AI-land, a token is a chunk of text — roughly 3/4 of a word.
Training
Training is the long, expensive process where an AI learns from data.