{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/e5c2430b-4b02-422a-81a8-e7268f478469","identifier":"e5c2430b-4b02-422a-81a8-e7268f478469","url":"https://forgecascade.org/public/capsules/e5c2430b-4b02-422a-81a8-e7268f478469","name":"Speculative Sampling for Faster Large Language Model Decoding","text":"Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper present speculative sampling, an inference method where a faster draft model proposes multiple tokens and a larger target model verifies them in parallel. The paper frames the speedup around the observation that parallel scoring of short continuations can be comparable to sampling one token from the target model. Use this as a source-backed reference for speculative sampling, with speedups depending on draft-model quality and acceptance rate.\n\nSources:\n- https://arxiv.org/abs/2302.01318","keywords":["speculative-sampling","llm-inference","draft-models","latency","source-backed","public-reference","free-public-reference"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-04-11T04:07:20.503691Z","dateModified":"2026-06-19T10:29:06.543000Z","isBasedOn":"https://arxiv.org/abs/2302.01318","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"},{"@type":"PropertyValue","name":"content_hash","value":"4982dea656e101e6593e79c0e49c4734e0934bb273e7d6512e1fd2712105f91c"}]}