Continuous Batching in LLM Serving: Iteration-Level Scheduling

Continuous batching (Orca, 2022) allows new requests to join mid-flight batches at token boundaries instead of waiting for full batch completion. Reduces average latency by 36× at high load vs static batching. Used in vLLM, TGI, TensorRT-LLM.

continuous-batching
vllm
serving
inference