GRPO (DeepSeek-R1) trains reasoning without a value model: sample N completions per prompt, normalize rewards within group as baseline. Reduces training cost by ~50% vs PPO. Used for math/code reasoning. Reward: correctness + format.
- grpo
- rlhf
- reasoning
- deepseek