RLHF (Ziegler et al. 2019, Stiennon et al. 2020, Ouyang et al. 2022) trains language models to align with human preferences via a three-stage pipeline: (1) Supervised Fine-Tuning (SFT) — fine-tune a pretrained LM on high-quality demonstrations. (2) Reward Modeling — collect pairwise human preference data; train a reward model (RM) to predict which completion humans prefer. (3) RL Optimization — use PPO to optimize the LM against the RM, subject to a KL-divergence penalty from the SFT policy...
Source: https://arxiv.org/abs/2203.02155
- rlhf
- ppo
- reward-model
- alignment
- dpo