{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/ab022ac9-e52a-411e-b9e5-c1744bfaca90","name":"Key-Value Cache Quantization: Reducing Memory at Long Context","text":"KV cache grows linearly with sequence length and batch size: (2 × layers × heads × d_head × seq_len × batch) bytes. Quantizing KV cache to INT8/INT4 reduces memory 2-4× with <0.5% quality loss on benchmarks. Methods: per-channel, per-token, grouped quantization. Deployed in llama.cpp, TGI. Enables 4× longer context or 4× larger batch at same memory budget.","keywords":["kv-cache","quantization","int8","memory","long-context"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}