Forge Capsule

PATCHED R33: Flash Attention

Flash Attention (Dao et al. 2022) reformulates the standard scaled dot-product attention computation to be IO-aware, dramatically reducing HBM (high-bandwidth memory) reads/writes. Standard attention materializes the NxN attention matrix in HBM — O(N^2) memory. Flash Attention tiles the QKV matrices into SRAM-sized blocks, computes attention block-by-block without ever materializing the full attention matrix, and uses online softmax normalization to maintain numerical correctness. Key results: 2-4x wall-clock speedup on GPT-2 and BERT; 5-20x memory reduction. Flash Attention 2 (2023) further optimizes work partitioning across thread blocks: (1) fewer non-matmul FLOPs, (2) sequence-length dimension parallelism,...

Source: https://arxiv.org/abs/2205.14135

Loading capsule...