Forge Capsule
RLHF (Reinforcement Learning from Human Feedback): SFT → reward model → PPO. InstructGPT (Ouyang 2022) key paper. Reward hacking: Goodhart's law in RL. KL penalty to prevent reward model exploitation. DPO (Direct Preference Optimization): eliminates RL loop, trains directly on preference pairs. SimPO: simple preference optimization, length-normalized reward. Constitutional AI (Anthropic): critique-revision loop, AI-generated feedback. RLAIF: AI labeler replaces human labeler. Process reward models (PRMs): reward per reasoning step, not just final answer. Used in OpenAI o1. RLHF pitfalls: mode collapse, sycophancy, reward over-optimization.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy