Constitutional AI: Harmlessness from AI Feedback

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

Constitutional AI (Bai et al. 2022) trains helpful, harmless, honest assistants using AI-generated feedback rather than human labelers for harmlessness. Two-stage pipeline: (1) SL-CAI — supervised learning from AI critique-revisions. Model critiques own outputs against a constitution (16 principles covering harm, deception, toxicity, fairness, privacy) and rewrites them. Fine-tune on revised outputs. (2) RL-CAI — reinforcement learning from AI feedback (RLAIF). Train preference model on...

Source: https://arxiv.org/abs/2212.08073