[{"data":1,"prerenderedAt":74},["ShallowReactive",2],{"term-k\u002Fkv-cache":3,"related-k\u002Fkv-cache":59},{"id":4,"title":5,"acronym":6,"body":7,"category":40,"description":41,"difficulty":42,"extension":43,"letter":44,"meta":45,"navigation":46,"path":47,"related":48,"seo":53,"sitemap":54,"stem":57,"subcategory":6,"__hash__":58},"terms\u002Fterms\u002Fk\u002Fkv-cache.md","KV Cache",null,{"type":8,"value":9,"toc":33},"minimark",[10,15,19,23,26,30],[11,12,14],"h2",{"id":13},"eli5-the-vibe-check","ELI5 — The Vibe Check",[16,17,18],"p",{},"KV cache is how LLMs remember previous tokens without recomputing them. Every time the model generates a token, it caches the key-value pairs from earlier layers. Without KV caching, each new token would cost as much as the first.",[11,20,22],{"id":21},"real-talk","Real Talk",[16,24,25],{},"Key-value (KV) cache is a core optimization in transformer inference that stores the attention keys and values computed for previous tokens, avoiding redundant computation during autoregressive generation. Critical to inference efficiency at long contexts. KV cache memory grows linearly with context length and dominates GPU memory for long sequences. Optimizations: paged attention (vLLM), quantized KV, and sliding-window attention.",[11,27,29],{"id":28},"when-youll-hear-this","When You'll Hear This",[16,31,32],{},"\"Running out of GPU memory? Check KV cache size first.\" \u002F \"Paged KV cache cut our memory footprint 4x.\"",{"title":34,"searchDepth":35,"depth":35,"links":36},"",2,[37,38,39],{"id":13,"depth":35,"text":14},{"id":21,"depth":35,"text":22},{"id":28,"depth":35,"text":29},"ai","KV cache is how LLMs remember previous tokens without recomputing them.","advanced","md","k",{},true,"\u002Fterms\u002Fk\u002Fkv-cache",[49,50,51,52],"Transformer","Inference","Context Window","Prefix Cache",{"title":5,"description":41},{"changefreq":55,"priority":56},"weekly",0.7,"terms\u002Fk\u002Fkv-cache","RMp-SFA-3KF7lV6oglw_hnSXyifnCPZ__XNFkW6270A",[60,65,68,71],{"title":51,"path":61,"acronym":6,"category":62,"difficulty":63,"description":64},"\u002Fterms\u002Fc\u002Fcontext-window","vibecoding","intermediate","A context window is how much text an AI can 'see' at once — its working memory.",{"title":50,"path":66,"acronym":6,"category":40,"difficulty":63,"description":67},"\u002Fterms\u002Fi\u002Finference","Inference is when the AI actually runs and generates output — as opposed to training, which is when it's learning.",{"title":52,"path":69,"acronym":6,"category":40,"difficulty":42,"description":70},"\u002Fterms\u002Fp\u002Fprefix-cache","Prefix cache is when an AI provider reuses computation from shared prompt prefixes.",{"title":49,"path":72,"acronym":6,"category":40,"difficulty":63,"description":73},"\u002Fterms\u002Ft\u002Ftransformer","The Transformer is THE architecture behind all modern AI. ChatGPT, Claude, Midjourney, Whisper — all transformers under the hood. The key innovation?",1776518290490]