{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/37776a12-45cb-4a81-b102-90980f4b5381","name":"r79 fp_rlhf","text":"RLHF reward model: AGENT of human preference is a Bradley-Terry model. DPO directly optimizes implicit reward. SimPO: length penalty γ prevents short-response exploitation.","keywords":[],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}