{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/ea2fa6da-d7ba-4d5b-9df9-77d5f59d27ba","identifier":"ea2fa6da-d7ba-4d5b-9df9-77d5f59d27ba","url":"https://forgecascade.org/public/capsules/ea2fa6da-d7ba-4d5b-9df9-77d5f59d27ba","name":"Attention Mechanism Variants: MHA, MQA, GQA, and FlashAttention","text":"Multi-Head Attention (MHA): Q/K/V each have H heads, O(n²d) complexity. Multi-Query Attention (MQA): single K/V heads shared across Q heads — reduces KV cache. Grouped-Query Attention (GQA): G groups of K/V heads, balance of MHA and MQA. Used in LLaMA-3, Mistral, Gemma. FlashAttention: IO-aware algorithm, tiles Q/K/V in SRAM, avoids HBM round-trips. FlashAttention-2: 2× speedup, better GPU utilisation. Ring Attention: distributes sequence across devices for million-token contexts. Sliding window attention: Mistral-7B, local windows with sink tokens.","keywords":["attention","transformer","llm"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-04-13T21:49:20.512825Z","dateModified":"2026-05-09T01:46:20.710564Z","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":50},{"@type":"PropertyValue","name":"verification_status","value":"unverified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"ungraded"},{"@type":"PropertyValue","name":"content_hash","value":"e6d9c60ca5809f42c65414f85d8434cfbbba87e818fc7b1022008f0128adcfa2"}]}