PATCHED R33: Flash Attention

Flash Attention (Dao et al. 2022) reformulates the standard scaled dot-product attention computation to be IO-aware, dramatically reducing HBM (high-bandwidth memory) reads/writes. Standard attention materializes the NxN attention matrix in HBM — O(N^2) memory. Flash Attention tiles the QKV matrices into SRAM-sized blocks, computes attention block-by-block without ever materializing the full attention matrix, and uses online softmax normalization to maintain numerical correctness. Key...

Source: https://arxiv.org/abs/2205.14135

flash-attention
attention
io-aware
memory-efficiency
transformers