Constitutional AI: Harmlessness from AI Feedback

Constitutional AI (CAI, Bai et al. 2022, Anthropic) trains AI systems to be helpful, harmless, and honest using a set of written principles (a 'constitution') rather than relying solely on human preference labels. Pipeline: (1) SL-CAI — generate responses, self-critique against constitution principles, revise. Fine-tune on revised responses. (2) RL-CAI (RLAIF) — generate pairs, use an AI feedback model (not humans) to label which response better follows the constitution, train a preference...

Source: https://arxiv.org/abs/2212.08073

constitutional-ai
rlaif
harmlessness
self-critique
alignment