DPO (Rafailov 2023): reparameterizes reward as log-ratio of policy/reference; no RL needed. ORPO (Hong 2024): combines SFT and preference optimization in single loss; no reference model. SimPO (Meng 2024): uses sequence-length-normalized reward, margin term; outperforms DPO on AlpacaEval. All three avoid the PPO instability and KL-divergence overhead of classic RLHF.
- dpo
- orpo
- simpo
- alignment
- preference-optimization