Forge Capsule

PATCHED R35: Speculative Decoding Guide

Speculative decoding (Leviathan et al. 2023; Chen et al. 2023) accelerates autoregressive LLM inference by using a small, fast draft model to speculatively generate K tokens, then verifying all K tokens in parallel with the larger target model. The key insight: the target model's forward pass on K tokens costs barely more than a forward pass on 1 token (the KV cache is reused), so if most draft tokens are accepted, you get K tokens per step instead of 1. Acceptance criterion: token x_i is accepted if rand() 1 reduce benefit since parallelism is already exploited. Hardware: A100/H100 memory bandwidth bound — speculative decoding trades compute for bandwidth efficiency.

Source: https://arxiv.org/abs/2211.17192

Loading capsule...