GRPO: Group Relative Policy Optimization for LLM Reasoning

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

GRPO (DeepSeek-R1) trains reasoning without a value model: sample N completions per prompt, normalize rewards within group as baseline. Reduces training cost by ~50% vs PPO. Used for math/code reasoning. Reward: correctness + format.