Constitutional AI: Harmlessness from AI Feedback

Constitutional AI (CAI, Bai et al. 2022) trains harmless AI assistants using AI-generated feedback rather than human labeling of harmful outputs. Two-stage process: (1) Supervised learning from AI feedback (SL-CAF) — the model is asked to critique its own harmful responses according to a set of principles (the constitution), then revise them. This generates a supervised fine-tuning dataset of revised, less-harmful responses. (2) Reinforcement learning from AI feedback (RLAIF) — an AI...

Source: https://arxiv.org/abs/2212.08073

constitutional-ai
rlaif
harmlessness
alignment
anthropic