Skip to main content
0%
MLOps

AI Observability: Metrics and Dashboards That Actually Matter

A practical guide to AI observability for production systems — including latency, drift, token usage, retrieval quality, and the dashboards teams actually use during incidents.

5 min read983 words

A lot of AI observability dashboards look impressive and answer almost none of the questions engineers need during an incident.

You don't need fifty charts. You need the few metrics that tell you:

  • Is the system healthy right now?
  • Is quality degrading?
  • Is a specific model, tenant, or workflow causing the problem?
  • Is cost moving in the wrong direction?

Production AI is not just an API with higher GPU bills. The signals are different from normal backend services, and the failure modes are wider: drift, retrieval regressions, prompt changes, token spikes, model routing errors, and user-visible quality drops without hard downtime.

Here's the dashboard stack we recommend.

Layer 1: Service Health

Start with the basics. If you don't have strong service health telemetry, the AI-specific charts won't save you.

Track:

  • request volume
  • success/error rate
  • p50/p95 latency
  • timeout rate
  • queue depth
  • saturation by dependency

These should be broken down by:

  • model
  • endpoint
  • tenant
  • route type

For example, "overall p95 latency" is less useful than "p95 latency for the generate-summary route on the 8B model for enterprise tenants."

Layer 2: Model Runtime Metrics

This is the layer many teams skip.

For model-serving workloads, add:

  • model load time
  • GPU utilization
  • GPU memory pressure
  • OOM events
  • active sequences or in-flight requests
  • tokens per second
  • prompt tokens and completion tokens
metrics:
  - model_request_latency_ms
  - model_tokens_input_total
  - model_tokens_output_total
  - gpu_utilization_percent
  - gpu_memory_used_bytes
  - model_runtime_oom_total

These metrics explain behavior that normal HTTP metrics hide. A service can be "up" while token throughput collapses and users wait 20 seconds for responses.

Layer 3: AI Quality Signals

This is where observability becomes AI-specific.

Examples of quality signals:

  • drift score by feature
  • abstention rate
  • hallucination review rate
  • citation usage rate
  • retrieval hit rate
  • fallback model activation
  • human escalation rate

These usually aren't infrastructure metrics. They come from product instrumentation, eval pipelines, and lightweight annotations.

That means observability for AI needs coordination across platform, ML, and product teams.

Layer 4: Cost and Efficiency

AI systems can degrade financially before they degrade technically.

Track:

  • cost per request
  • cost per tenant
  • tokens per workflow
  • GPU-hours per model
  • cache hit rate
  • percentage of requests handled by the cheapest acceptable model

If your routing rules drift and the expensive model handles every request, the problem may show up in finance before engineering notices.

Dashboards We Actually Use

1. Executive Summary Dashboard

This is the high-level view:

  • requests per minute
  • overall success rate
  • overall p95 latency
  • estimated daily cost
  • top failing workflows

Keep it simple. This is for quickly deciding whether something is broadly wrong.

2. Model Runtime Dashboard

This is the operator dashboard during serving issues:

  • per-model p95 latency
  • active requests per replica
  • token throughput
  • GPU utilization and memory usage
  • pod restarts and OOMs
  • model load durations

If a model rollout goes wrong, this dashboard tells you whether the issue is compute, concurrency, or bad pod behavior.

3. Retrieval and RAG Dashboard

For RAG systems, this deserves its own view:

  • retrieval hit rate
  • no-result rate
  • average documents retrieved
  • stale-document usage
  • reranker latency
  • citation presence

Without this dashboard, RAG incidents often get misdiagnosed as model failures.

4. Evaluation and Quality Dashboard

This dashboard tracks whether the system is getting better or worse over time:

  • eval pass rate by release
  • human review scores
  • hallucination or policy violation counts
  • task completion rate
  • abstention rate

If you change prompts or models frequently, this dashboard is essential.

Instrument the Request Path

Every AI request should emit structured logs or traces with enough detail to reconstruct behavior later.

{
  "request_id": "req_8f2a",
  "route": "customer-support-rag",
  "model": "llama3-8b-instruct",
  "tenant": "acme-co",
  "prompt_tokens": 1821,
  "completion_tokens": 264,
  "latency_ms": 2480,
  "retrieved_doc_count": 6,
  "fallback_used": false,
  "status": "success"
}

This data is what makes dashboards and incident reviews useful instead of speculative.

Alert on Changes, Not Just Outages

AI systems frequently degrade gradually.

Useful alerts include:

  • p95 latency increased 30% over baseline
  • abstention rate doubled for a specific workflow
  • retrieval hit rate dropped below threshold
  • prompt token size spiked after a product release
  • GPU memory pressure stayed above safe threshold
  • fallback model usage increased unexpectedly

Static "service down" alerts are necessary but not sufficient.

Common Dashboard Mistakes

We see these often:

  • Only tracking infrastructure metrics, with no quality signals
  • No breakdown by model or tenant, so hotspots are invisible
  • No cost telemetry, even though spend is part of system health
  • No link between evaluations and production releases
  • Too many charts, making incident response slower instead of faster

The dashboard should help someone decide what to do next. If it only looks sophisticated, it isn't doing its job.

A Practical Starting Stack

For many teams, this is enough:

  • Prometheus for service and runtime metrics
  • Grafana for dashboards
  • OpenTelemetry for traces
  • Structured application logs
  • a lightweight eval pipeline tied to releases

You can get far with that setup before you need specialized AI observability tooling.

Final Takeaway

AI observability is not about collecting more telemetry. It's about collecting the right telemetry at the right layers: service health, model runtime, answer quality, retrieval quality, and cost.

If your dashboards can't tell you whether a bad customer experience came from latency, drift, routing, retrieval, or cost pressure, they aren't finished yet.

Need help designing observability for production AI? We help teams instrument serving stacks, RAG systems, and model pipelines so incidents are diagnosable and costs stay visible. Book a free infrastructure audit to review your current monitoring gaps.

Share this article

Help others discover this content

Share with hashtags:

#Observability#Mlops#Monitoring#Ai Reliability#Production Ml
RT

Resilio Tech Team

Building AI infrastructure tools and sharing knowledge to help companies deploy ML systems reliably.

Article Info

Published3/27/2026
Reading Time5 min read
Words983