KV cache grows linearly with sequence length and batch size: (2 × layers × heads × d_head × seq_len × batch) bytes. Quantizing KV cache to INT8/INT4 reduces memory 2-4× with <0.5% quality loss on benchmarks. Methods: per-channel, per-token, grouped quantization. Deployed in llama.cpp, TGI. Enables 4× longer context or 4× larger batch at same memory budget.
- kv-cache
- quantization
- int8
- memory
- long-context