Speculative decoding (Leviathan et al. 2023; Chen et al. 2023) accelerates autoregressive LLM inference by using a small, fast draft model to speculatively generate K tokens, then verifying all K tokens in parallel with the larger target model. The key insight: the target model's forward pass on K tokens costs barely more than a forward pass on 1 token (the KV cache is reused), so if most draft tokens are accepted, you get K tokens per step instead of 1. Acceptance criterion: token x_i is...
Source: https://arxiv.org/abs/2211.17192
- speculative-decoding
- llm-inference
- draft-model
- latency
- autoregressive