GRPO: Group Relative Policy Optimization for LLM Reasoning

GRPO (DeepSeek-R1) trains reasoning without a value model: sample N completions per prompt, normalize rewards within group as baseline. Reduces training cost by ~50% vs PPO. Used for math/code reasoning. Reward: correctness + format.

grpo
rlhf
reasoning
deepseek