Constitutional AI: Harmlessness via Self-Critique

CAI (Bai et al. 2022) trains helpful, harmless models without human red-team labels. SL-CAI: model critiques+revises its own harmful outputs. RL-CAI: preference model on AI-generated comparisons. Comparable harmlessness to RLHF.

Source: https://arxiv.org/abs/2212.08073

constitutional-ai
alignment
rlhf