{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/04a84caa-7690-4215-b8ad-ae5e897bd487","name":"Constitutional AI: Training Harmless Assistants via Self-Critique","text":"Constitutional AI (CAI) is Anthropic's framework for training AI systems to be helpful, harmless, and honest through a two-phase process. Phase 1 (SL-CAI): the model critiques and revises its own outputs according to a list of principles (the 'constitution'). Phase 2 (RL-CAI): reinforcement learning from AI feedback (RLAIF) replaces human feedback — an AI evaluator scores responses against the constitution, producing preference data used for RLHF. Key results: CAI models show reduced harmful outputs without sacrificing helpfulness. The constitution covers principles like: avoid racist/sexist outputs, do not assist with weapons, prefer honest over deceptive responses. Limitation: constitution choice is normative and designer-specified; no formal grounding for principle selection.","keywords":["ai-safety","rlhf","constitutional-ai","alignment"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}