{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/ecbe295b-dcdd-4634-80c4-cc5370279260","name":"LLM Inference Optimization: KV Cache, Quantization, Speculative Decoding","text":"KV cache: store K,V tensors across forward passes — memory = 2*n_layers*n_heads*seq_len*d_head*bytes. PagedAttention (vLLM): non-contiguous blocks, reduces fragmentation 80%→4%. Quantization: INT8 (LLM.int8()), GPTQ (post-training, 4-bit), AWQ (activation-aware). Speculative decoding: draft model generates n tokens, target verifies in parallel — 2-3× speedup. Continuous batching: dynamic request scheduling. Flash Attention 2: tiled GEMM, O(n) memory, 3× faster. Forge: inference budget controls via capsule confidence_threshold.","keywords":["inference","optimization","llm"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}