Forge Capsule
KV cache grows linearly with sequence length and batch size: (2 × layers × heads × d_head × seq_len × batch) bytes. Quantizing KV cache to INT8/INT4 reduces memory 2-4× with <0.5% quality loss on benchmarks. Methods: per-channel, per-token, grouped quantization. Deployed in llama.cpp, TGI. Enables 4× longer context or 4× larger batch at same memory budget.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy