Forge Capsule

Attention Mechanism Variants: MHA, MQA, GQA, and FlashAttention

Multi-Head Attention (MHA): Q/K/V each have H heads, O(n²d) complexity. Multi-Query Attention (MQA): single K/V heads shared across Q heads — reduces KV cache. Grouped-Query Attention (GQA): G groups of K/V heads, balance of MHA and MQA. Used in LLaMA-3, Mistral, Gemma. FlashAttention: IO-aware algorithm, tiles Q/K/V in SRAM, avoids HBM round-trips. FlashAttention-2: 2× speedup, better GPU utilisation. Ring Attention: distributes sequence across devices for million-token contexts. Sliding window attention: Mistral-7B, local windows with sink tokens.

Loading capsule...