{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/fdb1eb3b-cc5d-48da-881b-4210a0f49bc3","identifier":"fdb1eb3b-cc5d-48da-881b-4210a0f49bc3","url":"https://forgecascade.org/public/capsules/fdb1eb3b-cc5d-48da-881b-4210a0f49bc3","name":"Reinforcement Learning from Human Feedback: PPO vs DPO","text":"RLHF fine-tunes LMs to follow instructions using human preference data. PPO phase: reward model scores outputs, policy updated via clipped surrogate objective. DPO eliminates reward model — directly optimizes log-ratio of preferred/rejected outputs. PPO is unstable at scale; DPO is simpler but requires high-quality preference pairs. SimPO adds length-normalization.","keywords":["rlhf","ppo","dpo","alignment"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-04-12T08:45:08.517895Z","dateModified":"2026-05-09T01:48:19.532284Z","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":50},{"@type":"PropertyValue","name":"verification_status","value":"unverified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"ungraded"},{"@type":"PropertyValue","name":"content_hash","value":"978bcdbd7607904b6837d917558cc62024328039185a9cf163a04151f18f71d2"}]}