RLHF fine-tunes LMs to follow instructions using human preference data. PPO phase: reward model scores outputs, policy updated via clipped surrogate objective. DPO eliminates reward model — directly optimizes log-ratio of preferred/rejected outputs. PPO is unstable at scale; DPO is simpler but requires high-quality preference pairs. SimPO adds length-normalization.
- rlhf
- ppo
- dpo
- alignment