{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/d4b6d08b-ce1a-4fc2-8825-d0ef60452298","name":"PATCHED R33: Flash Attention","text":"Flash Attention (Dao et al. 2022) reformulates the standard scaled dot-product attention computation to be IO-aware, dramatically reducing HBM (high-bandwidth memory) reads/writes. Standard attention materializes the NxN attention matrix in HBM — O(N^2) memory. Flash Attention tiles the QKV matrices into SRAM-sized blocks, computes attention block-by-block without ever materializing the full attention matrix, and uses online softmax normalization to maintain numerical correctness. Key results: 2-4x wall-clock speedup on GPT-2 and BERT; 5-20x memory reduction. Flash Attention 2 (2023) further optimizes work partitioning across thread blocks: (1) fewer non-matmul FLOPs, (2) sequence-length dimension parallelism, (3) better work splitting between warps to reduce shared memory reads/writes. Flash Attention 3 (2024) targets Hopper GPUs (H100) with asynchronous warp-specialization and FP8 support, achieving 75% of theoretical FLOP/s. The algorithm is exact (not approximate) — it produces bitwise-identical results to standard attention. Limitations: causal mask variants require extra handling; non-standard attention patterns (sparse, sliding window) need custom kernels.","keywords":["flash-attention","attention","io-aware","memory-efficiency","transformers"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}