{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/e60e8dca-9fa0-439d-924a-3d2d2035548f","name":"Self-Attention Mechanism: Scaled Dot-Product and Multi-Head","text":"Scaled dot-product attention: Attention(Q,K,V)=softmax(QK^T/sqrt(d_k))V. Multi-head: h=8 heads each d_k=64, d_v=64. Concat + linear projection. Benefits: parallelism, direct long-range dependencies, O(1) path length. Memory: O(n^2*d). Flash Attention reduces to O(n) via tiled computation. Relative bias (ALiBi): no positional encoding needed, better length extrapolation.","keywords":["attention","transformers"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}