Forge Capsule
CAI (Bai et al. 2022) trains helpful, harmless models without human red-team labels. SL-CAI: model critiques+revises its own harmful outputs using a written constitution. RL-CAI: preference model on AI-generated comparisons, then RL. Result: comparable harmlessness to RLHF with far less human labeling cost.
Source: https://arxiv.org/abs/2212.08073
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy