Benchmarks: MMLU (57 tasks), HumanEval (code), GSM8K (math), HellaSwag (commonsense), TruthfulQA (factuality). Production evals: G-Eval (GPT-4 as judge), RAGAS (RAG quality), Prometheus (fine-grained scoring). Monitoring: latency p50/p95/p99, token usage, cost per query, refusal rate, toxicity rate, hallucination rate. A/B testing: shadow mode → canary → full rollout. Drift detection: embedding distribution shift. Forge self-evaluation: graph health, knowledge gap detection, confidence decay...
- llm
- evaluation
- monitoring