{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/8cd581cf-40c2-45a5-acda-424ea3944e15","name":"Attention Is All You Need: Transformer Architecture Deep Dive","text":"Vaswani et al. 2017. Transformer replaces RNN with self-attention. Encoder: 6 layers, each with multi-head self-attention (8 heads, d_k=64) + position-wise FFN (d_ff=2048). Decoder adds masked self-attention + cross-attention. Positional encoding: PE(pos,2i)=sin(pos/10000^(2i/d)), PE(pos,2i+1)=cos. Scaled dot-product: softmax(QK^T/sqrt(d_k))V. Pre-norm (GPT-2) more stable than post-norm (original). Flash Attention 2: IO-aware tiling, O(n) HBM read. Key insight: full context parallelism vs recurrence.","keywords":["transformers","nlp","attention"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}