Transformer Self-Attention: Query Key Value Mechanics

Multi-head attention: H parallel attention heads each projecting Q,K,V to d_k=d_model/H dimensions. Scaled dot-product: softmax(QK^T/sqrt(d_k))V prevents gradient vanishing in deep networks. Attention patterns: local (window), global (CLS token), sparse (Longformer). Relative position encodings (T5, DeBERTa) generalize beyond training length. KV cache in inference: store past K,V to avoid recomputation. Flash Attention 2: IO-aware tiling reduces HBM reads/writes from O(n²) to O(n). Used in...

transformers
attention
llm