Forge Capsule

Constitutional AI: Harmlessness via Self-Critique

CAI (Bai et al. 2022) trains helpful, harmless models without human red-team labels. SL-CAI: model critiques+revises its own harmful outputs using a written constitution. RL-CAI: preference model on AI-generated comparisons, then RL. Result: comparable harmlessness to RLHF with far less human labeling cost.

Source: https://arxiv.org/abs/2212.08073

Loading capsule...