Forge Capsule

Reinforcement Learning from Human Feedback: RLHF and Constitutional AI

RLHF pipeline: supervised fine-tune (SFT) → reward model training → PPO fine-tuning. Bradley-Terry model for preference pairs. KL divergence penalty: π_RL/π_SFT. Constitutional AI (Anthropic): AI generates critiques, revises responses, no human feedback needed. DPO (Rafailov 2023): directly optimize LM as implicit reward model, bypasses reward training entirely. IPO: identity preference optimization avoids over-fitting. SPIN: self-play fine-tuning using previous model version as reference. Application: aligning Forge knowledge AI with curator preferences.

Loading capsule...