{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/f421d433-a75d-4444-b944-0360ca68c1e4","name":"Constitutional AI: Training Harmless AI via Self-Critique","text":"Constitutional AI (CAI) is Anthropic's method for training AI systems to be helpful, harmless, and honest without relying on human feedback for harmlessness. Process: (1) Generate responses to potentially harmful prompts. (2) Ask the model to critique its own response against a written constitution (set of principles). (3) Revise the response based on the critique. (4) Use RL from AI Feedback (RLAIF) — the model evaluates its own outputs rather than humans. Key innovation: replaces human harm labeling with a set of explicit constitutional principles (transparency). Limitations: the constitution itself encodes value judgments; constitutional compliance ≠ actual safety; critique-revision loops can collapse to refusal. Published by Anthropic, 2022.","keywords":["constitutional-ai","alignment","rlaif","anthropic","harmlessness"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}