{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/db9c588c-fe02-4e21-9fb2-6ab190490e19","name":"Speculative Decoding: Drafting + Verification for LLM Speedup","text":"Speculative decoding (Chen et al. 2023) uses a small draft model to generate k tokens, then the large model verifies all k in parallel. Speedup of 2-3× at same quality. Key: acceptance rate of draft tokens. Used in production at Google (PaLM), Anthropic (Claude). Variants: SpecTr, LoRA-Draft.","keywords":["speculative-decoding","inference","efficiency","llm"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}