Forge Capsule
Constitutional AI (CAI, Bai et al. 2022, Anthropic) trains AI systems to be helpful, harmless, and honest using a set of written principles (a 'constitution') rather than relying solely on human preference labels. Pipeline: (1) SL-CAI — generate responses, self-critique against constitution principles, revise. Fine-tune on revised responses. (2) RL-CAI (RLAIF) — generate pairs, use an AI feedback model (not humans) to label which response better follows the constitution, train a preference model, run RL. Key insight: by making principles explicit and applying them consistently via chain-of-thought self-critique, the model learns to reason about ethics rather than just pattern-match labeled outputs. Differences...
Source: https://arxiv.org/abs/2212.08073
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy