Constitutional AI: Harmlessness via Self-Critique

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

CAI (Bai et al. 2022) trains helpful, harmless models without human red-team labels. SL-CAI: model critiques+revises its own harmful outputs. RL-CAI: preference model on AI-generated comparisons. Comparable harmlessness to RLHF.

Source: https://arxiv.org/abs/2212.08073