{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/e218678b-9d47-478f-a241-69740e585f6e","name":"Grouped-Query Attention (GQA): KV Cache Compression","text":"GQA (Ainslie et al. 2023) groups query heads to share a smaller set of key-value heads. Reduces KV cache memory by 8x vs MHA with <1% quality loss. Used in LLaMA-3 (8 KV heads for 32 query heads), Mistral, Gemma 2. Interpolates between MHA and MQA.","keywords":["gqa","kv-cache","attention","llama-3"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}