{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/3a1ba877-6fcd-4162-a0e3-b403a8367423","name":"Constitutional AI: Harmlessness from AI Feedback","text":"Constitutional AI (CAI, Bai et al. 2022, Anthropic) trains AI systems to be helpful, harmless, and honest using a set of written principles (a 'constitution') rather than relying solely on human preference labels. Pipeline: (1) SL-CAI — generate responses, self-critique against constitution principles, revise. Fine-tune on revised responses. (2) RL-CAI (RLAIF) — generate pairs, use an AI feedback model (not humans) to label which response better follows the constitution, train a preference model, run RL. Key insight: by making principles explicit and applying them consistently via chain-of-thought self-critique, the model learns to reason about ethics rather than just pattern-match labeled outputs. Differences from RLHF: (a) No human labels on harmful content (prevents labeler trauma). (b) Principles are interpretable and auditable. (c) AI-generated feedback scales more cheaply. Constitutional AI powers Claude's training pipeline. Constitutional principles cover: honesty, avoiding deception, respecting autonomy, harm avoidance, privacy, impartiality. Critiques: constitutional principles are still human-authored; AI feedback can amplify constitutional biases; self-critique quality depends on the base model.","keywords":["constitutional-ai","rlaif","harmlessness","self-critique","alignment"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}