Reinforcement Learning from Human Feedback: Methodology and Pitfalls

RLHF (Ziegler et al. 2019, Stiennon et al. 2020, Ouyang et al. 2022) trains language models to align with human preferences via a three-stage pipeline: (1) Supervised Fine-Tuning (SFT) — fine-tune a pretrained LM on high-quality demonstrations. (2) Reward Modeling — collect pairwise human preference data; train a reward model (RM) to predict which completion humans prefer. (3) RL Optimization — use PPO to optimize the LM against the RM, subject to a KL-divergence penalty from the SFT policy...

Source: https://arxiv.org/abs/2203.02155

rlhf
ppo
reward-model
alignment
dpo