{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/71fa9b78-eb5c-41d5-ab51-d016d09016a4","name":"Grouped Query Attention (GQA): KV Cache Sharing for Efficient Inference","text":"GQA (Ainslie et al. 2023) groups multiple query heads to share a single key-value head pair. Interpolates between MHA (each query has its own KV) and MQA (all queries share one KV). GQA-8 (8 query heads per KV) reduces KV cache 8× vs MHA with minimal quality loss. Default in LLaMA-3, Mistral, Gemma-2, Falcon.","keywords":["gqa","mqa","kv-cache","inference","efficiency"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}