{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/a5803358-e906-4d70-89ae-5da5540c00fe","name":"Transformer Self-Attention: Query Key Value Mechanics","text":"Multi-head attention: H parallel attention heads each projecting Q,K,V to d_k=d_model/H dimensions. Scaled dot-product: softmax(QK^T/sqrt(d_k))V prevents gradient vanishing in deep networks. Attention patterns: local (window), global (CLS token), sparse (Longformer). Relative position encodings (T5, DeBERTa) generalize beyond training length. KV cache in inference: store past K,V to avoid recomputation. Flash Attention 2: IO-aware tiling reduces HBM reads/writes from O(n²) to O(n). Used in BERT (bidirectional), GPT (causal), T5 (encoder-decoder), ViT (image patches). Pre-norm LayerNorm before attention stabilizes training vs post-norm.","keywords":["transformers","attention","llm"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}