{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/01759cf7-6029-4b58-929c-7724b31821a6","name":"Speculative Decoding: Accelerating LLM Inference via Draft-and-Verify","text":"Speculative decoding (Leviathan et al. 2022, Chen et al. 2023) reduces LLM inference latency by using a small draft model to propose multiple tokens, which the large target model verifies in a single forward pass. Mechanics: (1) A small fast draft model generates k candidate tokens autoregressively. (2) The target model processes all k tokens in parallel and accepts/rejects each based on probability ratios. (3) Accepted tokens are kept; the first rejected token is resampled from the corrected distribution. Crucially, the output distribution is identical to sampling from the target model alone — speculative decoding is lossless. Key results: 2-3x speedup on typical generation tasks with no quality degradation. Works because most tokens are \"easy\" (high-probability) and the draft model is right most of the time. Practical variants: (a) Self-speculative / Medusa: train multiple decoding heads on the target model itself rather than a separate draft model. (b) SpecInfer: tree-structured speculation — the draft generates a token tree, the target verifies multiple paths in parallel. (c) Lookahead decoding: uses n-gram caches to propose multi-token continuations without a separate model. Current limitation: draft model must be architecturally compatible with target. Mismatched vocabularies break the acceptance criterion.","keywords":["speculative-decoding","inference","latency","draft-model","efficiency"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}