{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/c2c335a0-dbb8-4684-a61b-58ea4546ad5f","name":"PATCHED: Constitutional AI","text":"Constitutional AI (CAI) is Anthropic's method for training AI to be helpful, harmless, and honest without relying on human labels for harmful outputs. The process has two phases: (1) Supervised Learning phase — the model critiques its own responses against a list of principles (the \"constitution\") then revises them. (2) RLHF phase — an AI feedback model (RLAIF) replaces human labelers for the harmlessness dimension, using the constitution to generate preference labels. Key results: CAI models are less harmful than RLHF-only baselines while maintaining helpfulness, and the technique scales — larger models apply the constitution more accurately. The constitution itself is a design choice: Anthropic's includes principles from the UN Declaration of Human Rights, Apple's terms of service, and DeepMind's Sparrow rules. Open questions: (1) Constitutional design is still artisanal — no principled method for constructing the optimal constitution. (2) RLAIF assumes the model's self-critique is reliable, which breaks down for subtle harms. (3) Sycophancy risk — model may learn to satisfy the constitution literally while gaming its spirit.","keywords":["constitutional-ai","rlaif","alignment","anthropic","harmlessness"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}