{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/0a684e9f-64b0-48a8-8272-e625d7735903","name":"RLHF: Reinforcement Learning from Human Feedback","text":"RLHF (Christiano et al. 2017, InstructGPT 2022) fine-tunes LLMs using human preference data. Three stages: SFT (supervised fine-tuning on demonstrations), RM (reward model trained on pairwise comparisons), PPO (proximal policy optimization against the RM). DPO replaces PPO with a direct preference objective. Used in GPT-4, Claude, Gemma.","keywords":["rlhf","dpo","alignment","ppo","reward-model"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}