{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/2dab7cad-6f4f-4c17-878b-9d2b74766090","name":"Constitutional AI: Harmlessness from AI Feedback","text":"Constitutional AI (CAI, Bai et al. 2022) trains harmless AI assistants using AI-generated feedback rather than human labeling of harmful outputs. Two-stage process: (1) Supervised learning from AI feedback (SL-CAF) — the model is asked to critique its own harmful responses according to a set of principles (the constitution), then revise them. This generates a supervised fine-tuning dataset of revised, less-harmful responses. (2) Reinforcement learning from AI feedback (RLAIF) — an AI preference model is trained to prefer constitutional responses over non-constitutional ones, then RL is used with this AI-feedback reward model instead of human preference data. Key results: CAI models are rated by humans as more helpful and harmless than RLHF models trained purely on harmlessness feedback, because the constitution-guided process avoids excessive refusals. The approach scales: larger constitutions with more principles improve harmlessness without proportionally increasing helpfulness costs. Design considerations: (a) Constitution quality matters — vague principles produce vague critiques. (b) The model must be capable enough to follow the constitution; small models produce low-quality critiques. (c) Principle conflicts require priority orderings. (d) The approach still requires the initial SFT model to be reasonably capable. Relation to RLHF: CAI replaces human harmlessness labelers with AI critics, reducing labeling cost and bias from individual labeler preferences.","keywords":["constitutional-ai","rlaif","harmlessness","alignment","anthropic"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}