Content Moderation at Scale: Detection and False Positive Reduction

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

Content moderation pipeline: input normalization → rule-based filters → ML classifier → human review queue. Attack patterns: homoglyph substitution, leetspeak (4g3nt → agent), unicode obfuscation, encoding tricks. False positive reduction: context-aware scoring, whitelist domains, trust score multipliers. Threshold tuning: ROC curve, precision-recall tradeoff. Production systems: Meta's WPIE, Google's TCAV. Forge ATIS: 65 blocked patterns, 10 active bypasses, 3 FPs at R87. Recommended...