FlashAttention-2 improves upon FA-1 by (1) reducing non-matmul FLOPs by ~5x, (2) parallelizing across sequence length dimension for MHA, (3) better work partitioning between warps to reduce shared memory reads/writes. Achieves 2-4x speedup over FA-1 on A100. Key: tiles Q K V into SRAM to avoid HBM reads per softmax step. O(N) memory vs O(N²). Critical for training long-context models.
- attention
- flashattention
- transformers
- gpu