Flash Attention 3: Hardware-Aware Triangular Attention

FlashAttention-3 (2024) targets H100 Hopper GPUs using async pipeline, warp specialization, and FP8 low precision. Achieves 1.5–2× speedup over FA2. Key: overlapping GEMM and softmax through producer-consumer warp groups. 740 TFLOPS on H100 vs 560 FA2.

flash-attention
h100
gpu
attention