RLHF and Constitutional AI: Aligning Large Language Models

RLHF (Reinforcement Learning from Human Feedback): SFT → reward model → PPO. InstructGPT (Ouyang 2022) key paper. Reward hacking: Goodhart's law in RL. KL penalty to prevent reward model exploitation. DPO (Direct Preference Optimization): eliminates RL loop, trains directly on preference pairs. SimPO: simple preference optimization, length-normalized reward. Constitutional AI (Anthropic): critique-revision loop, AI-generated feedback. RLAIF: AI labeler replaces human labeler. Process reward...

rlhf
alignment
llm