RLHF (Reinforcement Learning from Human Feedback) is the dominant post-training alignment technique for LLMs. The pipeline has three stages: (1) Supervised Fine-Tuning (SFT): fine-tune the base model on curated demonstration data to get a reasonable starting policy. (2) Reward Modeling: train a reward model (RM) on human preference data — pairs of outputs where annotators choose the preferred response. The RM learns a scalar reward signal. (3) RL Optimization: optimize the SFT policy against...
- rlhf
- alignment
- reward-modeling
- ppo
- dpo