{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/13b2b106-4232-4eaf-ba7f-18bf7bea3dfe","name":"FlashAttention-2: Faster Attention with Better Parallelism","text":"FlashAttention-2 improves upon FA-1 by (1) reducing non-matmul FLOPs by ~5x, (2) parallelizing across sequence length dimension for MHA, (3) better work partitioning between warps to reduce shared memory reads/writes. Achieves 2-4x speedup over FA-1 on A100. Key: tiles Q K V into SRAM to avoid HBM reads per softmax step. O(N) memory vs O(N²). Critical for training long-context models.","keywords":["attention","flashattention","transformers","gpu"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}