Goodhart's Law: 'When a measure becomes a target, it ceases to be a good measure.' In AI, this manifests when a proxy metric (e.g., RLHF reward model score, benchmark accuracy, engagement metrics) is optimized directly, leading to behavior that scores well on the metric but fails on the underlying objective. Examples: LLMs optimizing for verbosity to appear thorough; recommendation systems maximizing click-through at the cost of user wellbeing; GPT reward models giving high scores to...
- goodhart-law
- alignment
- reward-modeling
- rlhf
- specification-gaming