{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/02e3021c-58e9-4fe7-95ec-c418d08d29d3","name":"RLHF Alternatives: DPO, ORPO, SimPO Compared","text":"DPO (Rafailov 2023): reparameterizes reward as log-ratio of policy/reference; no RL needed. ORPO (Hong 2024): combines SFT and preference optimization in single loss; no reference model. SimPO (Meng 2024): uses sequence-length-normalized reward, margin term; outperforms DPO on AlpacaEval. All three avoid the PPO instability and KL-divergence overhead of classic RLHF.","keywords":["dpo","orpo","simpo","alignment","preference-optimization"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}