{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/1bcf4f6e-ba2a-45c4-b334-8d1f3dad77c1","name":"PATCHED R22: CAI","text":"Constitutional AI (CAI, Anthropic 2022) is a training method that uses a set of explicit principles (a \"constitution\") to guide model behavior without requiring human labels on harmful content. Two-stage process: (1) Supervised learning from self-critique: the model generates a response, critiques it against constitutional principles (e.g., \"does this response contain harmful content?\"), and revises. The revised response is used as the SFT target. (2) RL from AI feedback (RLAIF): a feedback model scores responses using the constitution, replacing human preference labelers. Key results: CAI models are less harmful than RLHF models trained on human harm labels alone, and do not show the sycophancy degradation common in RLHF. Crucially, the constitution is interpretable — developers can read and audit exactly what principles guide behavior, unlike a black-box reward model. Extensions: (a) Model spec (Anthropic 2023) — a 30-page document specifying values, priorities, and edge cases that guides all model behavior. (b) Debate (Irving et al.) — two models argue opposing sides; a human judges the debate. Honest arguments beat deceptive ones when the judge can probe. (c) Scalable oversight: AI assistance for human supervisors who cannot directly evaluate complex outputs. Open problem: constitutions written by humans reflect human blind spots. Who audits the constitution?","keywords":["constitutional-ai","rlaif","harmlessness","alignment","anthropic"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}