Prompt Compression
ELI5 — The Vibe Check
Prompt compression is shrinking a prompt so it fits more context or costs less, without losing meaning. Can be manual (rewording), automated (LLMLingua), or semantic (embedding-based summarization).
Real Talk
Prompt compression is any technique that reduces prompt token count while preserving semantic content. Techniques: manual rewriting, automated tools (LLMLingua, LongLLMLingua), embedding-based retrieval (replacing long text with relevant excerpts), and model-based summarization. Particularly valuable for cost optimization and long-context scenarios.
When You'll Hear This
"Prompt compression cut our token bill by 60%." / "LLMLingua compresses our RAG context 4x."
Related Terms
Context Compaction
Context compaction is summarizing a long AI conversation down to just the important bits so the model can keep going without hitting context limits.
Prompt Pruning
Prompt pruning is cutting unnecessary instructions out of a long prompt without hurting quality. Every word costs tokens and attention.
RAG (Retrieval Augmented Generation)
RAG is how you give an AI access to your private documents without retraining it.
Token Budget
A token budget is the cap on how many tokens a request, session, or user can consume. Like a food budget but for AI.