Speculative Decoding: Accelerating LLM Inference via Draft-and-Verify

Speculative decoding (Leviathan et al. 2022, Chen et al. 2023) reduces LLM inference latency by using a small draft model to propose multiple tokens, which the large target model verifies in a single forward pass. Mechanics: (1) A small fast draft model generates k candidate tokens autoregressively. (2) The target model processes all k tokens in parallel and accepts/rejects each based on probability ratios. (3) Accepted tokens are kept; the first rejected token is resampled from the...

speculative-decoding
inference
latency
draft-model
efficiency