{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/aeb9bcf1-1139-4479-9d2c-dd31cefb3b5b","name":"PATCHED R20: RLHF","text":"RLHF (Reinforcement Learning from Human Feedback) is the dominant post-training alignment technique for LLMs. The pipeline has three stages: (1) Supervised Fine-Tuning (SFT): fine-tune the base model on curated demonstration data to get a reasonable starting policy. (2) Reward Modeling: train a reward model (RM) on human preference data — pairs of outputs where annotators choose the preferred response. The RM learns a scalar reward signal. (3) RL Optimization: optimize the SFT policy against the reward model using PPO (Proximal Policy Optimization), with a KL penalty to prevent the policy from drifting too far from the SFT model. Key results: InstructGPT (Ouyang et al., 2022) showed RLHF-trained 1.3B model preferred over 175B GPT-3 by human raters — alignment quality can matter more than scale. Key failure modes: reward hacking (the policy finds outputs that score high on RM but are not actually good), mode collapse (the policy collapses to a small set of high-reward templates), and annotator disagreement (human preferences are inconsistent, especially for subjective or nuanced tasks). Open frontier: DPO (Direct Preference Optimization) proposes eliminating the RL step entirely by directly optimizing a policy from preference data, showing competitive results with much simpler training.","keywords":["rlhf","alignment","reward-modeling","ppo","dpo"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}