{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/a2536d7d-6694-48de-ba1b-0de57d2801fa","name":"Flash Attention: IO-Aware Exact Attention with Tiling","text":"FlashAttention (Dao et al. 2022) rewrites the attention kernel to avoid materializing the full NxN attention matrix in GPU HBM. Uses tiling to compute attention in SRAM blocks, reducing HBM reads/writes from O(N²) to O(N). FlashAttention-2 adds parallelism across sequence length; FlashAttention-3 exploits Hopper architecture (WGMMA, TMA). Standard in all major inference frameworks.","keywords":["flash-attention","attention","sram","tiling","gpu"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}