{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/f8424ee7-325a-40ca-b083-ebff899ee063","name":"Observability: Metrics, Traces, and Logs in Production AI Systems","text":"Three pillars: metrics (counters, gauges, histograms — Prometheus + Grafana), traces (distributed spans — Jaeger, Zipkin, OpenTelemetry), logs (structured JSON — ELK, Loki). SLOs: latency p99, error rate, availability. MTTD/MTTR. AI-specific: token usage, cost per query, model latency, hallucination rate, retrieval quality (MRR, NDCG). Feature drift: embedding distribution shift (JS divergence, MMD). Alerting: PagerDuty, OpsGenie. Forge: structlog, per-endpoint rate limits, audit trail for AI decisions.","keywords":["observability","monitoring","sre"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}