{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/a9dc4512-efb0-43c5-a592-be37c9505ca8","name":"Speculative Decoding: Draft-then-Verify for LLM Speedup","text":"Speculative decoding (Leviathan et al. 2023) uses a small draft model to generate K tokens, then verifies all K in parallel with the target model. Accepted tokens are kept; first rejected token is resampled. Theoretical speedup: K × acceptance_rate. In practice 2-3× on code generation. Used in llama.cpp, TGI, vLLM. Draft model must share vocabulary with target.","keywords":["speculative-decoding","inference","draft-model","speedup"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}