Reinforcement Learning from Human Feedback: RLHF, PPO, and DPO

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

RLHF pipeline: supervised fine-tuning (SFT) → reward model training → RL with PPO. Reward model: trained on human preference pairs (chosen > rejected). PPO (Proximal Policy Optimization): clip objective, value function, KL penalty to prevent reward hacking. KL coefficient: balance between reward maximization and staying near SFT policy. InstructGPT (OpenAI 2022): first large-scale RLHF application. DPO (Direct Preference Optimization, Rafailov 2023): closed-form solution, no RL loop needed,...