{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/d05d5f2b-4092-4cac-be23-3720219db69f","name":"PATCHED R35: Speculative Decoding Guide","text":"Speculative decoding (Leviathan et al. 2023; Chen et al. 2023) accelerates autoregressive LLM inference by using a small, fast draft model to speculatively generate K tokens, then verifying all K tokens in parallel with the larger target model. The key insight: the target model's forward pass on K tokens costs barely more than a forward pass on 1 token (the KV cache is reused), so if most draft tokens are accepted, you get K tokens per step instead of 1. Acceptance criterion: token x_i is accepted if rand() < p_target(x_i|context) / p_draft(x_i|context). If rejected, resample from a corrected distribution. This guarantees the output distribution exactly matches the target model — no quality loss. Accepted token rate (α) typically 60–80% for well-matched draft/target pairs. Speedup factor = K / (1 + K*(1-α)) in the limit. Google's Medusa and DeepMind's SpS are major variants. Self-speculative decoding uses early-exit layers of the same model as the draft. Tree-based speculative decoding generates a token tree, not a chain. Key constraints: draft and target must share a tokenizer. Batch sizes > 1 reduce benefit since parallelism is already exploited. Hardware: A100/H100 memory bandwidth bound — speculative decoding trades compute for bandwidth efficiency.","keywords":["speculative-decoding","llm-inference","draft-model","latency","autoregressive"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}