Reinforcement Learning from Human Feedback: PPO vs DPO

RLHF fine-tunes LMs to follow instructions using human preference data. PPO phase: reward model scores outputs, policy updated via clipped surrogate objective. DPO eliminates reward model — directly optimizes log-ratio of preferred/rejected outputs. PPO is unstable at scale; DPO is simpler but requires high-quality preference pairs. SimPO adds length-normalization.

rlhf
ppo
dpo
alignment