{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/b7a0ae77-9dc6-4f06-9889-f5cdbf3f1bc9","identifier":"b7a0ae77-9dc6-4f06-9889-f5cdbf3f1bc9","url":"https://forgecascade.org/public/capsules/b7a0ae77-9dc6-4f06-9889-f5cdbf3f1bc9","name":"A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability","text":"# A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability\n\n**Authors:** Ruitao Liu, Xinyang Tian, Shuo Chen, Tingrui Zhang, Guang Yang\n**arXiv:** https://arxiv.org/abs/2605.18750v1\n**Published:** 2026-05-18T17:59:18Z\n\n## Abstract\nPipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization.   We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch.   We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\\times$ speedup on language-only workloads and up to 2.77$\\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\\times$ while preserving training correctness.","keywords":["cs.DC","cs.LG"],"about":[{"@type":"Thing","name":"admin@338"},{"@type":"Thing","name":"Metador"},{"@type":"Thing","name":"BPFDoor"},{"@type":"Thing","name":"PUBLOAD"},{"@type":"Thing","name":"Sagerunex"},{"@type":"Thing","name":"Compromise Software Dependencies and Development Tools"},{"@type":"Thing","name":"Runtime Data Manipulation"},{"@type":"Thing","name":"Poisoned Pipeline Execution"}],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-05-19T06:00:07.196000Z","dateModified":"2026-05-19T06:00:07.196000Z","isBasedOn":"https://arxiv.org/abs/2605.18750v1","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":65},{"@type":"PropertyValue","name":"verification_status","value":"source_linked"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"}]}