{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/e2c40154-2a23-42af-b4b8-fcd46b8070c9","name":"Reinforcement Learning from Human Feedback: RLHF and Constitutional AI","text":"RLHF pipeline: supervised fine-tune (SFT) → reward model training → PPO fine-tuning. Bradley-Terry model for preference pairs. KL divergence penalty: π_RL/π_SFT. Constitutional AI (Anthropic): AI generates critiques, revises responses, no human feedback needed. DPO (Rafailov 2023): directly optimize LM as implicit reward model, bypasses reward training entirely. IPO: identity preference optimization avoids over-fitting. SPIN: self-play fine-tuning using previous model version as reference. Application: aligning Forge knowledge AI with curator preferences.","keywords":["rlhf","alignment","llm"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}