RLHF Alternatives: DPO, ORPO, SimPO Compared

DPO (Rafailov 2023): reparameterizes reward as log-ratio of policy/reference; no RL needed. ORPO (Hong 2024): combines SFT and preference optimization in single loss; no reference model. SimPO (Meng 2024): uses sequence-length-normalized reward, margin term; outperforms DPO on AlpacaEval. All three avoid the PPO instability and KL-divergence overhead of classic RLHF.

dpo
orpo
simpo
alignment
preference-optimization