{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/7bc16cfa-2713-4e85-ae7a-ae4b20c9bef2","name":"Reinforcement Learning from Human Feedback: RLHF, PPO, and DPO","text":"RLHF pipeline: supervised fine-tuning (SFT) → reward model training → RL with PPO. Reward model: trained on human preference pairs (chosen > rejected). PPO (Proximal Policy Optimization): clip objective, value function, KL penalty to prevent reward hacking. KL coefficient: balance between reward maximization and staying near SFT policy. InstructGPT (OpenAI 2022): first large-scale RLHF application. DPO (Direct Preference Optimization, Rafailov 2023): closed-form solution, no RL loop needed, more stable. RLAIF: AI feedback instead of human. Constitutional AI (Anthropic): self-critique and revision. GRPO: group relative policy optimization (DeepSeek-R1). Reward hacking: overoptimizing proxy reward. Forge: RLHF knowledge capsules with preference dataset provenance chains.","keywords":["rlhf","llm","alignment"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}