Constitutional AI: RLHF with Self-Critique

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

CAI (Anthropic 2022) trains models to evaluate and revise their own outputs against a set of principles (the constitution). Two phases: SL-CAI (supervised fine-tuning on self-revised outputs) and RL-CAI (RL from AI feedback using constitution as reward model). Reduces harmful outputs without human labeler feedback. Basis for Claude models.