Skip to content

Inference

Medium — good to knowAI & ML

ELI5 — The Vibe Check

Inference is when the AI actually runs and generates output — as opposed to training, which is when it's learning. When you type a prompt and hit enter, that's inference. Training is the expensive months-long process; inference is the moment-to-moment work of generating answers. Inference costs are what your API bill is.

Real Talk

Inference is the process of running a trained model on new input to generate predictions or outputs. For LLMs, inference involves a forward pass through the network for each generated token. Inference speed and cost are determined by model size, hardware (GPU/TPU), batching, and optimization techniques like quantization and KV caching.

When You'll Hear This

"Inference latency is too high — users see a 5 second delay." / "We run inference on A100 GPUs."

Made with passive-aggressive love by manoga.digital. Powered by Claude.