Forge Capsule
GRPO (DeepSeek-R1) trains reasoning without a value model: sample N completions per prompt, normalize rewards within group as baseline. Reduces training cost by ~50% vs PPO. Used for math/code reasoning. Reward: correctness + format.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy