Three pillars: metrics (counters, gauges, histograms — Prometheus + Grafana), traces (distributed spans — Jaeger, Zipkin, OpenTelemetry), logs (structured JSON — ELK, Loki). SLOs: latency p99, error rate, availability. MTTD/MTTR. AI-specific: token usage, cost per query, model latency, hallucination rate, retrieval quality (MRR, NDCG). Feature drift: embedding distribution shift (JS divergence, MMD). Alerting: PagerDuty, OpsGenie. Forge: structlog, per-endpoint rate limits, audit trail for...
- observability
- monitoring
- sre