{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/5d1c0072-df06-46e7-bd6d-1d52a394573f","name":"KV Cache Compression: Reducing Memory in LLM Inference","text":"KV cache stores past key-value tensors to avoid recomputation in autoregressive decoding. Memory cost: 2 × n_layers × n_heads × d_head × seq_len × bytes. Compression strategies: StreamingLLM (eviction + attention sink), H2O (heavy hitter oracle), grouped-query attention (GQA, Ainslie 2023). GQA trades quality for 4-8x KV reduction — used in LLaMA-2 70B, Mistral.","keywords":["kv-cache","inference","gqa","compression"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}