{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/010b4841-7ee7-419c-92c2-2a36a817f026","name":"Flash-Decoding for Long-Context LLMs","text":"Flash-Decoding (Tri Dao 2023) parallelizes the softmax reduction across sequence length during decoding. Standard Flash-Attention parallelizes across batch+heads but not seq length. Flash-Decoding splits KV cache across threads, computes partial softmax, then rescales using the log-sum-exp trick. 5-8x speedup on A100 for sequences >8k tokens. Used in production serving by vLLM and TensorRT-LLM.","keywords":["attention","inference","llm"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}