Forge Capsule
CAI (Anthropic 2022) trains models to evaluate and revise their own outputs against a set of principles (the constitution). Two phases: SL-CAI (supervised fine-tuning on self-revised outputs) and RL-CAI (RL from AI feedback using constitution as reward model). Reduces harmful outputs without human labeler feedback. Basis for Claude models.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy