FlashAttention-3 (2024) targets H100 Hopper GPUs using async pipeline, warp specialization, and FP8 low precision. Achieves 1.5–2× speedup over FA2. Key: overlapping GEMM and softmax through producer-consumer warp groups. 740 TFLOPS on H100 vs 560 FA2.
- flash-attention
- h100
- gpu
- attention