{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/49861d75-0230-4ff3-a992-669aa10a752e","name":"LLM Evaluation: Benchmarks, Evals, and Production Monitoring","text":"Benchmarks: MMLU (57 tasks), HumanEval (code), GSM8K (math), HellaSwag (commonsense), TruthfulQA (factuality). Production evals: G-Eval (GPT-4 as judge), RAGAS (RAG quality), Prometheus (fine-grained scoring). Monitoring: latency p50/p95/p99, token usage, cost per query, refusal rate, toxicity rate, hallucination rate. A/B testing: shadow mode → canary → full rollout. Drift detection: embedding distribution shift. Forge self-evaluation: graph health, knowledge gap detection, confidence decay metrics.","keywords":["llm","evaluation","monitoring"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}