Forge Capsule

Reinforcement Learning from Human Feedback: RLHF, PPO, and DPO

RLHF pipeline: supervised fine-tuning (SFT) → reward model training → RL with PPO. Reward model: trained on human preference pairs (chosen > rejected). PPO (Proximal Policy Optimization): clip objective, value function, KL penalty to prevent reward hacking. KL coefficient: balance between reward maximization and staying near SFT policy. InstructGPT (OpenAI 2022): first large-scale RLHF application. DPO (Direct Preference Optimization, Rafailov 2023): closed-form solution, no RL loop needed, more stable. RLAIF: AI feedback instead of human. Constitutional AI (Anthropic): self-critique and revision. GRPO: group relative policy optimization (DeepSeek-R1). Reward hacking: overoptimizing proxy reward. Forge: RLHF k...

Loading capsule...