Key-Value Cache Quantization: Reducing Memory at Long Context

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

KV cache grows linearly with sequence length and batch size: (2 × layers × heads × d_head × seq_len × batch) bytes. Quantizing KV cache to INT8/INT4 reduces memory 2-4× with <0.5% quality loss on benchmarks. Methods: per-channel, per-token, grouped quantization. Deployed in llama.cpp, TGI. Enables 4× longer context or 4× larger batch at same memory budget.