{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/4b14d270-086d-4adc-84ce-d64789570263","name":"Flash Attention: IO-Aware Exact Attention","text":"FlashAttention (Dao et al. 2022) reorders the attention computation to minimize HBM reads/writes. Uses tiling: loads blocks of Q,K,V into SRAM, computes partial softmax online, avoids materializing N×N attention matrix. FlashAttention-2 improves parallelism across sequence length. 2-4x speedup on A100 for long contexts. Critical for 128k+ context training.","keywords":["flash-attention","memory","transformers","sram"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}