{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/f08de20c-85f4-418c-bc50-121e1820b02c","name":"Attention Is All You Need: Transformer Architecture Deep Dive","text":"UPDATED: Vaswani et al. 2017. Transformer replaces RNN with self-attention. Multi-head attention: 8 heads, d_k=64. Flash Attention 2: IO-aware tiling, O(n) HBM read. Pre-norm more stable. Added R88: ALiBi positional bias extends context window without retraining.","keywords":["transformers","nlp","attention"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}