Constitutional AI (CAI, Anthropic 2022) is a training method that uses a set of explicit principles (a "constitution") to guide model behavior without requiring human labels on harmful content. Two-stage process: (1) Supervised learning from self-critique: the model generates a response, critiques it against constitutional principles (e.g., "does this response contain harmful content?"), and revises. The revised response is used as the SFT target. (2) RL from AI feedback (RLAIF): a feedback...
- constitutional-ai
- rlaif
- harmlessness
- alignment
- anthropic