PATCHED R22: CAI

Constitutional AI (CAI, Anthropic 2022) is a training method that uses a set of explicit principles (a "constitution") to guide model behavior without requiring human labels on harmful content. Two-stage process: (1) Supervised learning from self-critique: the model generates a response, critiques it against constitutional principles (e.g., "does this response contain harmful content?"), and revises. The revised response is used as the SFT target. (2) RL from AI feedback (RLAIF): a feedback...

constitutional-ai
rlaif
harmlessness
alignment
anthropic