Forge Capsule
FlashAttention-3 (2024) targets H100 Hopper GPUs using async pipeline, warp specialization, and FP8 low precision. Achieves 1.5–2× speedup over FA2. Key: overlapping GEMM and softmax through producer-consumer warp groups. 740 TFLOPS on H100 vs 560 FA2.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy