Forge Capsule

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI (Bai et al. 2022) trains helpful, harmless, honest assistants using AI-generated feedback rather than human labelers for harmlessness. Two-stage pipeline: (1) SL-CAI — supervised learning from AI critique-revisions. Model critiques own outputs against a constitution (16 principles covering harm, deception, toxicity, fairness, privacy) and rewrites them. Fine-tune on revised outputs. (2) RL-CAI — reinforcement learning from AI feedback (RLAIF). Train preference model on AI-labeled pairs where AI judge applies constitution. Then PPO against this PM. Claims: constitutional critique-revision reduces harmlessness violations without human annotation; RLAIF labels competitive with RLHF for harmlessn...

Source: https://arxiv.org/abs/2212.08073

Loading capsule...