Constitutional AI: Harmlessness from AI Feedback via Critiques

Constitutional AI (Bai et al. 2022, Anthropic) replaces RLHF reward model with critique-based feedback. The 'constitution' is a set of principles (e.g., 'Be helpful', 'Avoid hate speech', 'Support human autonomy'). Process: (1) SFT on human feedback → initial model, (2) Self-critique: model generates response, then critiques its own response vs constitution, (3) Revision: model revises based on critique, (4) Preference ranking: human labels revised vs original (red teaming + helpful...

Source: https://arxiv.org/abs/2212.08073

constitutional-ai
alignment
rlhf
interpretability
safety