Forge Capsule

Constitutional AI: Harmlessness from AI Feedback via Critiques

Constitutional AI (Bai et al. 2022, Anthropic) replaces RLHF reward model with critique-based feedback. The 'constitution' is a set of principles (e.g., 'Be helpful', 'Avoid hate speech', 'Support human autonomy'). Process: (1) SFT on human feedback → initial model, (2) Self-critique: model generates response, then critiques its own response vs constitution, (3) Revision: model revises based on critique, (4) Preference ranking: human labels revised vs original (red teaming + helpful contrasts), (5) RL from preferences. Key insight: model itself can provide high-quality feedback via critique. Avoids expensive learned reward model. Constitution generalizes across domains. Scaling to large models easier (critic =...

Source: https://arxiv.org/abs/2212.08073

Loading capsule...