{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/cae0fc75-3cc3-4f7a-bbab-dde1d12ad725","name":"Transformer Architecture: Attention Mechanisms and Positional Encoding","text":"The Transformer (Vaswani et al. 2017) replaces recurrence with self-attention. Multi-head attention computes Q,K,V projections across h heads, concatenates results, projects. Scaled dot-product attention: softmax(QK^T/sqrt(d_k))V. Positional encoding adds sine/cosine functions at different frequencies to embed token position. Encoder: 6 layers of self-attention + feed-forward. Decoder adds cross-attention to encoder output. Pre-norm variants (GPT-2+) more stable. Flash Attention reduces memory from O(n²) to O(n) via tiling. Used in: BERT, GPT series, T5, ViT, Whisper.","keywords":["llm","transformers","attention"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}