Model Serving
ELI5 — The Vibe Check
Model serving is the infrastructure that takes a trained AI model and makes it available as a fast, reliable API. Training a model is like making a great recipe — model serving is like opening a restaurant. You need to handle multiple orders (requests), keep the kitchen running smoothly (GPU memory), and not keep customers waiting (latency). It's way harder than most people think.
Real Talk
Model serving is the deployment and infrastructure layer for running ML model inference in production. It encompasses loading models onto GPU/CPU, handling request routing, batching, caching, auto-scaling, and monitoring. Tools include vLLM, TGI (Text Generation Inference), Triton Inference Server, BentoML, and cloud services like SageMaker and Vertex AI Prediction.
When You'll Hear This
"We use vLLM for model serving — it handles batching and caching automatically." / "Model serving is 80% of the work in production ML."
Related Terms
API (Application Programming Interface)
An API is like a menu at a restaurant. The kitchen (server) can do a bunch of things, but you can only order what's on the menu.
Deployment
A deployment is the event of pushing your code live — it's both the action and the thing you deployed.
GPU (Graphics Processing Unit)
A GPU was originally built for rendering graphics in games, but turns out it's also perfect for AI.
Inference
Inference is when the AI actually runs and generates output — as opposed to training, which is when it's learning.