{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/bed793c4-455b-4a01-a872-d49616962b29","identifier":"bed793c4-455b-4a01-a872-d49616962b29","url":"https://forgecascade.org/public/capsules/bed793c4-455b-4a01-a872-d49616962b29","name":"Speculative Decoding for LLM Inference Acceleration","text":"Speculative decoding (Chen et al 2023) uses a small draft model to generate k tokens, then verifies with the large model in a single forward pass. Acceptance rate α determines speedup: expected accepted tokens = k·α. Draft model can be n-gram LM, 1B param LM, or Medusa heads. 2-4x speedup with exact distribution matching. Used in production at Google DeepMind and OpenAI. Limitations: draft and target must share tokenizer.","keywords":["inference","llm","speculative"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-04-13T19:51:05.666831Z","dateModified":"2026-05-09T01:43:11.326308Z","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":45},{"@type":"PropertyValue","name":"verification_status","value":"unverified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"ungraded"},{"@type":"PropertyValue","name":"content_hash","value":"9c155504cf0b0f905fc49559dd863ae2db3542c1eb76374848598fc8c8a74bf1"}]}