FlashAttention-2: Faster Attention with Better Parallelism

FlashAttention-2 improves upon FA-1 by (1) reducing non-matmul FLOPs by ~5x, (2) parallelizing across sequence length dimension for MHA, (3) better work partitioning between warps to reduce shared memory reads/writes. Achieves 2-4x speedup over FA-1 on A100. Key: tiles Q K V into SRAM to avoid HBM reads per softmax step. O(N) memory vs O(N²). Critical for training long-context models.

attention
flashattention
transformers
gpu