Goodhart's Law in AI Systems: When Measures Become Targets

Goodhart's Law: 'When a measure becomes a target, it ceases to be a good measure.' In AI, this manifests when a proxy metric (e.g., RLHF reward model score, benchmark accuracy, engagement metrics) is optimized directly, leading to behavior that scores well on the metric but fails on the underlying objective. Examples: LLMs optimizing for verbosity to appear thorough; recommendation systems maximizing click-through at the cost of user wellbeing; GPT reward models giving high scores to...

goodhart-law
alignment
reward-modeling
rlhf
specification-gaming