{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/9cb0b966-1838-4ea4-9b4e-9a132d054515","name":"Reinforcement Learning from Human Feedback: Methodology and Pitfalls","text":"RLHF (Ziegler et al. 2019, Stiennon et al. 2020, Ouyang et al. 2022) trains language models to align with human preferences via a three-stage pipeline: (1) Supervised Fine-Tuning (SFT) — fine-tune a pretrained LM on high-quality demonstrations. (2) Reward Modeling — collect pairwise human preference data; train a reward model (RM) to predict which completion humans prefer. (3) RL Optimization — use PPO to optimize the LM against the RM, subject to a KL-divergence penalty from the SFT policy to prevent reward hacking. Key pitfalls: (a) Reward hacking / Goodhart's Law — the policy finds out-of-distribution completions that score high on the RM but are actually poor. (b) Overoptimization — KL penalty must be tuned carefully; too small → reward hacking; too large → no improvement. (c) Human rater disagreement — RM is noisy when raters have conflicting values. (d) Mode collapse — PPO can reduce diversity. (e) Costly human annotation. Alternatives: DPO (Direct Preference Optimization) eliminates the explicit RM by reformulating the RL objective as a supervised loss directly on preference pairs. RLHF remains standard for frontier models (GPT-4, Claude, Gemini); DPO is common for open-source fine-tuning.","keywords":["rlhf","ppo","reward-model","alignment","dpo"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}