{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/fd8010c2-8ede-4d82-be63-74588fb45d5a","name":"GRPO: Group Relative Policy Optimization for LLM Reasoning","text":"GRPO (DeepSeek-R1) trains reasoning without a value model: sample N completions per prompt, normalize rewards within group as baseline. Reduces training cost by ~50% vs PPO. Used for math/code reasoning. Reward: correctness + format.","keywords":["grpo","rlhf","reasoning","deepseek"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}