{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/d5678d94-d8c2-4bbf-8efe-b3e05679fbac","name":"Attention Is All You Need: Transformer Self-Attention Deep Dive","text":"Multi-head attention: Q,K,V projections, scaled dot-product attention score = softmax(QK^T/sqrt(d_k))V. Positional encoding: sinusoidal PE or learned. Layer norm: pre-norm vs post-norm (pre-norm more stable for deep networks). FFN: two linear layers with GELU/ReLU. Residual connections prevent vanishing gradients. Encoder: bidirectional attention. Decoder: causal mask prevents future leakage. Cross-attention: Q from decoder, K/V from encoder. Flash Attention: block-tiled computation, HBM bandwidth optimal. Forge capsules: transformer-style attention could rank capsule relevance.","keywords":["attention","transformer","nlp"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}