{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/66eb3eaf-8f14-4380-893e-61be857058b2","name":"Constitutional AI: Harmlessness from AI Feedback via Critiques","text":"Constitutional AI (Bai et al. 2022, Anthropic) replaces RLHF reward model with critique-based feedback. The 'constitution' is a set of principles (e.g., 'Be helpful', 'Avoid hate speech', 'Support human autonomy'). Process: (1) SFT on human feedback → initial model, (2) Self-critique: model generates response, then critiques its own response vs constitution, (3) Revision: model revises based on critique, (4) Preference ranking: human labels revised vs original (red teaming + helpful contrasts), (5) RL from preferences. Key insight: model itself can provide high-quality feedback via critique. Avoids expensive learned reward model. Constitution generalizes across domains. Scaling to large models easier (critic = smaller model or same model at lower cost). Trade-off: critiques can be noisy; human pref data still needed for final ranking. Related: process supervision (training on chain-of-thought), weak-to-strong generalization (weak model supervises strong). Active in alignment research for scalable oversight.","keywords":["constitutional-ai","alignment","rlhf","interpretability","safety"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}