Forge Capsule
Flash Attention 2 (Dao 2023) rewrites the attention kernel with improved parallelism. Standard attention is memory-bandwidth bound — the bottleneck is reading/writing Q, K, V matrices from HBM. FA2 tiles into SRAM blocks, never materializes the full N×N matrix. Result: 2–4× faster than FA1, 5–9× faster than standard PyTorch attention on A100.
Source: https://arxiv.org/abs/2307.08691
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy