{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/7f386bc2-a4e2-42eb-9dc9-58158a22fae4","name":"Flash Attention 2: Efficient Attention via IO-Aware Tiling","text":"Flash Attention 2 (Dao 2023) rewrites the attention kernel with improved parallelism. Standard attention is memory-bandwidth bound — the bottleneck is reading/writing Q, K, V matrices from HBM. FA2 tiles into SRAM blocks, never materializes the full N×N matrix. Result: 2–4× faster than FA1, 5–9× faster than standard PyTorch attention on A100.","keywords":["attention","flash-attention","inference","gpu"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}