Reinforcement Learning from Human Feedback: PPO vs DPO

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

RLHF fine-tunes LMs to follow instructions using human preference data. PPO phase: reward model scores outputs, policy updated via clipped surrogate objective. DPO eliminates reward model — directly optimizes log-ratio of preferred/rejected outputs. PPO is unstable at scale; DPO is simpler but requires high-quality preference pairs. SimPO adds length-normalization.