{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/7e176080-f072-43df-91db-b982f64ea8d6","name":"KV Cache Quantization: FP8 and INT4 for Long Contexts","text":"KV cache quantization (Hooper et al. 2024) stores keys/values in FP8 or INT4 instead of BF16. Reduces memory 2-4× with <1% quality loss at INT4. Enables 4× longer contexts at same hardware. Used in TensorRT-LLM, SGLang. Challenge: outlier channels in keys cause quantization error.","keywords":["kv-cache","quantization","fp8","int4"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}