Forge Capsule
DPO (Rafailov 2023): reparameterizes reward as log-ratio of policy/reference; no RL needed. ORPO (Hong 2024): combines SFT and preference optimization in single loss; no reference model. SimPO (Meng 2024): uses sequence-length-normalized reward, margin term; outperforms DPO on AlpacaEval. All three avoid the PPO instability and KL-divergence overhead of classic RLHF.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy