Forge Capsule

RLHF and Constitutional AI: Aligning Large Language Models

RLHF (Reinforcement Learning from Human Feedback): SFT → reward model → PPO. InstructGPT (Ouyang 2022) key paper. Reward hacking: Goodhart's law in RL. KL penalty to prevent reward model exploitation. DPO (Direct Preference Optimization): eliminates RL loop, trains directly on preference pairs. SimPO: simple preference optimization, length-normalized reward. Constitutional AI (Anthropic): critique-revision loop, AI-generated feedback. RLAIF: AI labeler replaces human labeler. Process reward models (PRMs): reward per reasoning step, not just final answer. Used in OpenAI o1. RLHF pitfalls: mode collapse, sycophancy, reward over-optimization.

Loading capsule...