Constitutional AI: RLHF with Self-Critique

CAI (Anthropic 2022) trains models to evaluate and revise their own outputs against a set of principles (the constitution). Two phases: SL-CAI (supervised fine-tuning on self-revised outputs) and RL-CAI (RL from AI feedback using constitution as reward model). Reduces harmful outputs without human labeler feedback. Basis for Claude models.

cai
constitutional-ai
rlhf
anthropic