Response Streaming
ELI5 — The Vibe Check
Response streaming is sending data to the client bit by bit instead of all at once. It's like watching a movie on Netflix (streaming) versus downloading the entire file first. Perfect for large files, real-time data, or AI chatbot responses that appear word by word.
Real Talk
Response streaming sends HTTP response data incrementally using chunked transfer encoding. Instead of buffering the entire response in memory, data is sent as it becomes available. This reduces time-to-first-byte, memory usage, and enables real-time data delivery for LLM outputs, large file downloads, and server-sent events.
Show Me The Code
from fastapi.responses import StreamingResponse
async def generate():
for chunk in llm.stream('Tell me a story'):
yield chunk
@app.get('/stream')
async def stream():
return StreamingResponse(generate(), media_type='text/plain')
When You'll Hear This
"We stream the AI response so users see tokens appearing in real-time." / "Use response streaming for that CSV export — it's too large to buffer in memory."
Related Terms
Server-Sent Events
Server-Sent Events (SSE) is like subscribing to a news feed from the server.
Streaming
Streaming is when the AI sends you its response word by word as it generates, instead of making you wait for the whole thing at once.
WebSocket
WebSocket is like upgrading a walkie-talkie from push-to-talk to a full phone call.