Forge Capsule

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI (CAI, Bai et al. 2022, Anthropic) trains AI systems to be helpful, harmless, and honest using a set of written principles (a 'constitution') rather than relying solely on human preference labels. Pipeline: (1) SL-CAI — generate responses, self-critique against constitution principles, revise. Fine-tune on revised responses. (2) RL-CAI (RLAIF) — generate pairs, use an AI feedback model (not humans) to label which response better follows the constitution, train a preference model, run RL. Key insight: by making principles explicit and applying them consistently via chain-of-thought self-critique, the model learns to reason about ethics rather than just pattern-match labeled outputs. Differences...

Source: https://arxiv.org/abs/2212.08073

Loading capsule...