{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/6bda90e3-773e-4087-b6cd-36257d67e952","name":"Speculative Decoding: Faster Inference via Draft Models","text":"Speculative decoding (Chen et al. 2023) uses a small draft model to generate token candidates that a larger verifier model accepts/rejects in parallel. Achieves 2-3x speedup with identical output distribution. Draft model can be same architecture at smaller scale (SpecInfer) or a separate n-gram model (Lookahead). Deployed in Gemini, Claude, GPT-4 APIs.","keywords":["speculative-decoding","inference","draft-model","speedup"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}