Flash Attention (Dao et al. 2022) reformulates the standard scaled dot-product attention computation to be IO-aware, dramatically reducing HBM (high-bandwidth memory) reads/writes. Standard attention materializes the NxN attention matrix in HBM — O(N^2) memory. Flash Attention tiles the QKV matrices into SRAM-sized blocks, computes attention block-by-block without ever materializing the full attention matrix, and uses online softmax normalization to maintain numerical correctness. Key...
Source: https://arxiv.org/abs/2205.14135
- flash-attention
- attention
- io-aware
- memory-efficiency
- transformers