Constitutional AI: Training Harmless Assistants via Self-Critique

Constitutional AI (CAI) is Anthropic's framework for training AI systems to be helpful, harmless, and honest through a two-phase process. Phase 1 (SL-CAI): the model critiques and revises its own outputs according to a list of principles (the 'constitution'). Phase 2 (RL-CAI): reinforcement learning from AI feedback (RLAIF) replaces human feedback — an AI evaluator scores responses against the constitution, producing preference data used for RLHF. Key results: CAI models show reduced harmful...

ai-safety
rlhf
constitutional-ai
alignment