KV Cache Compression: Reducing Memory in LLM Inference

KV cache stores past key-value tensors to avoid recomputation in autoregressive decoding. Memory cost: 2 × n_layers × n_heads × d_head × seq_len × bytes. Compression strategies: StreamingLLM (eviction + attention sink), H2O (heavy hitter oracle), grouped-query attention (GQA, Ainslie 2023). GQA trades quality for 4-8x KV reduction — used in LLaMA-2 70B, Mistral.

kv-cache
inference
gqa
compression