PATCHED R35: Speculative Decoding Guide

Speculative decoding (Leviathan et al. 2023; Chen et al. 2023) accelerates autoregressive LLM inference by using a small, fast draft model to speculatively generate K tokens, then verifying all K tokens in parallel with the larger target model. The key insight: the target model's forward pass on K tokens costs barely more than a forward pass on 1 token (the KV cache is reused), so if most draft tokens are accepted, you get K tokens per step instead of 1. Acceptance criterion: token x_i is...

Source: https://arxiv.org/abs/2211.17192

speculative-decoding
llm-inference
draft-model
latency
autoregressive