Skip to main content
0%
MLOps

Setting Up ML Model Monitoring with Prometheus and Grafana: A Complete Guide

A deep, practical guide to ML model monitoring with Prometheus and Grafana, including instrumentation, Prometheus scrape configs, recording rules, Grafana dashboards, and alert rules for latency, errors, and drift.

15 min read2,965 words

Most model monitoring guides stop at generic advice: track latency, watch drift, build a dashboard.

That is not enough when you are running real inference traffic.

Production ML monitoring has to answer concrete questions:

  • Is the serving path healthy right now?
  • Is model quality degrading even though the API still returns 200?
  • Did a rollout break one model version or one feature pipeline?
  • Which alerts should wake up an operator, and which ones should stay on the dashboard?

This guide shows how to build a working ml model monitoring prometheus grafana stack for production use. The focus is practical:

  • application instrumentation
  • Prometheus scrape configuration
  • recording rules and alert rules
  • Grafana dashboards for operators
  • a downloadable starter dashboard JSON

The examples assume a Python model service, but the same monitoring design works for Java, Go, or Rust serving stacks.

What a Useful ML Monitoring Stack Must Cover

Traditional service monitoring is necessary, but it is only the first layer.

For model-serving systems, you need at least four categories of telemetry:

1. Service health

This is the baseline:

  • request rate
  • success and error rate
  • p50, p95, and p99 latency
  • saturation and queue depth
  • replica restarts

Without this layer, you cannot distinguish a model issue from a normal service incident.

2. Inference behavior

This is where ML systems start to diverge from normal web apps:

  • per-model latency
  • per-version traffic share
  • batch size
  • feature completeness
  • prediction distribution shifts

These metrics tell you whether the model server is behaving differently even when the infrastructure looks healthy.

3. Data and quality proxies

Most teams cannot compute ground-truth accuracy in real time. That is normal. Instead, they monitor leading indicators:

  • drift score by feature
  • rate of null or defaulted features
  • out-of-range values
  • prediction class balance
  • business proxy metrics such as approval rate, fraud review rate, or escalation rate

This is the minimum viable answer to model performance monitoring guide in the real world. You need signals that move before customers complain.

4. Actionable alerting

Dashboards are for diagnosis. Alerts are for interruption. They should be reserved for conditions with a clear response path:

  • latency above SLO
  • error rate above threshold
  • sustained drift on critical features
  • sudden changes in prediction mix
  • missing-feature rate above safe bounds

The mistake is not a lack of alerts. The mistake is alerts with no owner and no runbook.

Reference Architecture

The cleanest way to setup ml monitoring is to split metrics into two planes:

  1. Online metrics emitted by the inference service
  2. Batch or streaming quality metrics computed outside the request path

That gives you a monitoring architecture like this:

Clients
  |
  v
Inference API / Model Server ----> /metrics endpoint ----> Prometheus
  |                                                      /    |    \
  |                                                     /     |     \
  |---- structured logs ------------------------------>/   Alerting   Grafana
  |
  |---- predictions + features ----> feature store / warehouse
                                           |
                                           v
                              drift job / quality batch job
                                           |
                                           v
                                   metrics exporter
                                           |
                                           v
                                      Prometheus

The key design choice is that drift and quality metrics are usually not computed inline on every request. They are produced by scheduled jobs or stream processors and exposed to Prometheus through an exporter.

That keeps the serving path fast while still giving operators visibility into model behavior.

Step 1: Instrument the Model Service

Start with request, error, and model-specific metrics in the application itself. For a Python service, prometheus_client is enough.

from fastapi import FastAPI, HTTPException
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from prometheus_client import CONTENT_TYPE_LATEST
from starlette.responses import Response
import time

app = FastAPI()

INFERENCE_REQUESTS = Counter(
    "ml_inference_requests_total",
    "Total inference requests",
    ["model_name", "model_version", "route", "status"]
)

INFERENCE_LATENCY = Histogram(
    "ml_inference_latency_seconds",
    "Inference latency in seconds",
    ["model_name", "model_version", "route"],
    buckets=(0.01, 0.025, 0.05, 0.1, 0.2, 0.35, 0.5, 0.75, 1.0, 2.0, 5.0)
)

FEATURE_MISSING_RATE = Gauge(
    "ml_feature_missing_ratio",
    "Missing feature ratio for the last evaluation window",
    ["model_name", "feature_name"]
)

PREDICTION_SCORE = Histogram(
    "ml_prediction_score",
    "Distribution of model prediction scores",
    ["model_name", "model_version"],
    buckets=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
)

MODEL_INFO = Gauge(
    "ml_model_info",
    "Static information about the deployed model",
    ["model_name", "model_version", "framework"]
)

MODEL_INFO.labels(
    model_name="claims-risk",
    model_version="2026-04-01",
    framework="xgboost"
).set(1)

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

@app.post("/predict")
def predict(payload: dict):
    start = time.perf_counter()
    model_name = "claims-risk"
    model_version = "2026-04-01"
    route = "predict"

    try:
        feature_count = len(payload.get("features", {}))
        if feature_count == 0:
            raise HTTPException(status_code=400, detail="missing features")

        score = 0.82
        PREDICTION_SCORE.labels(model_name, model_version).observe(score)
        INFERENCE_REQUESTS.labels(model_name, model_version, route, "success").inc()
        return {"score": score, "model_version": model_version}
    except Exception:
        INFERENCE_REQUESTS.labels(model_name, model_version, route, "error").inc()
        raise
    finally:
        duration = time.perf_counter() - start
        INFERENCE_LATENCY.labels(model_name, model_version, route).observe(duration)

Three implementation details matter here:

  • Label by model_name and model_version, or you will not be able to isolate a bad rollout.
  • Keep label cardinality controlled. Do not use customer_id, request_id, or raw feature values as labels.
  • Put prediction-distribution metrics behind a histogram or pre-aggregated gauge. Do not emit every score as a unique log-only event and then expect Prometheus to save you.

Step 2: Export Drift and Data Quality Signals

Latency and errors come from the live service. Drift usually comes from a scheduled comparison job.

For example, you might compare the latest 15-minute inference window against a training baseline stored in object storage or a feature store snapshot. That job can expose metrics like:

  • ml_data_drift_score{model_name="claims-risk",feature_name="income"}
  • ml_data_drift_detected{model_name="claims-risk",feature_name="income"}
  • ml_feature_missing_ratio{model_name="claims-risk",feature_name="zipcode"}

A lightweight exporter process is often enough:

from prometheus_client import Gauge, start_http_server
import time

DRIFT_SCORE = Gauge(
    "ml_data_drift_score",
    "Current drift score by feature",
    ["model_name", "feature_name"]
)

DRIFT_DETECTED = Gauge(
    "ml_data_drift_detected",
    "1 when drift threshold is exceeded",
    ["model_name", "feature_name"]
)

def compute_feature_drift():
    # Replace with your actual statistical test or Evidently output.
    return {
        "income": 0.19,
        "zipcode": 0.04,
        "claim_amount": 0.23,
    }

if __name__ == "__main__":
    start_http_server(9109)
    model_name = "claims-risk"

    while True:
        scores = compute_feature_drift()
        for feature_name, score in scores.items():
            DRIFT_SCORE.labels(model_name, feature_name).set(score)
            DRIFT_DETECTED.labels(model_name, feature_name).set(1 if score > 0.15 else 0)
        time.sleep(60)

You can run this exporter as:

  • a sidecar next to the batch worker
  • a standalone deployment fed by Kafka or a warehouse query
  • an Airflow-triggered service that refreshes gauges every few minutes

The correct choice depends on your platform. The metric shape is what matters.

Step 3: Configure Prometheus to Scrape Both the Model Service and the Drift Exporter

You now have two scrape targets: the inference API and the quality exporter.

Here is a working prometheus.yml pattern:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/rules/ml-recording-rules.yml
  - /etc/prometheus/rules/ml-alert-rules.yml

scrape_configs:
  - job_name: model-inference
    metrics_path: /metrics
    static_configs:
      - targets:
          - model-api.monitoring.svc.cluster.local:8000
        labels:
          service: model-api
          environment: production

  - job_name: drift-exporter
    metrics_path: /metrics
    static_configs:
      - targets:
          - model-drift-exporter.monitoring.svc.cluster.local:9109
        labels:
          service: drift-exporter
          environment: production

  - job_name: node-exporter
    static_configs:
      - targets:
          - node-exporter.monitoring.svc.cluster.local:9100

If you are already on the Prometheus Operator, the same idea usually becomes a ServiceMonitor. The underlying design does not change:

  • scrape live inference metrics frequently
  • scrape drift and quality metrics frequently enough to reflect the batch update interval

Do not scrape drift once every 15 seconds if the underlying computation only refreshes every 15 minutes. That creates visual noise without adding signal.

Step 4: Add Recording Rules for Fast Dashboards

PromQL is powerful, but heavy percentile queries across raw histograms can make dashboards sluggish. Recording rules create stable, reusable series for dashboards and alerting.

groups:
  - name: ml-recording-rules
    interval: 30s
    rules:
      - record: job:ml_inference_rps:rate5m
        expr: sum by (model_name, model_version, route) (
          rate(ml_inference_requests_total[5m])
        )

      - record: job:ml_inference_error_rate:ratio5m
        expr: (
          sum by (model_name, model_version, route) (
            rate(ml_inference_requests_total{status="error"}[5m])
          )
        ) / (
          sum by (model_name, model_version, route) (
            rate(ml_inference_requests_total[5m])
          )
        )

      - record: job:ml_inference_latency_p95_seconds:5m
        expr: histogram_quantile(
          0.95,
          sum by (le, model_name, model_version, route) (
            rate(ml_inference_latency_seconds_bucket[5m])
          )
        )

      - record: job:ml_prediction_score_p50:30m
        expr: histogram_quantile(
          0.50,
          sum by (le, model_name, model_version) (
            rate(ml_prediction_score_bucket[30m])
          )
        )

      - record: job:ml_drift_features_triggered:current
        expr: sum by (model_name) (ml_data_drift_detected)

These recording rules pay off immediately:

  • dashboards render faster
  • alert expressions stay readable
  • operators reuse the same definitions everywhere

If your team debates whether p95 was calculated differently in two dashboards, you already have a monitoring consistency problem. Recording rules are the fix.

Step 5: Define Alert Rules for Latency, Errors, and Drift

This is the part most teams underinvest in. They build a dashboard but never decide what should actually trigger a page.

Start with a small alert set that covers the main failure modes.

groups:
  - name: ml-alert-rules
    rules:
      - alert: MLInferenceHighLatencyP95
        expr: job:ml_inference_latency_p95_seconds:5m > 0.35
        for: 10m
        labels:
          severity: page
          team: ml-platform
        annotations:
          summary: "P95 inference latency is above 350ms"
          description: "Model {{ $labels.model_name }} version {{ $labels.model_version }} on route {{ $labels.route }} has sustained high latency."

      - alert: MLInferenceHighErrorRate
        expr: job:ml_inference_error_rate:ratio5m > 0.03
        for: 10m
        labels:
          severity: page
          team: ml-platform
        annotations:
          summary: "Inference error rate is above 3%"
          description: "Model {{ $labels.model_name }} version {{ $labels.model_version }} is returning elevated errors."

      - alert: MLDriftCriticalFeatureDetected
        expr: ml_data_drift_detected{feature_name=~"income|claim_amount|age"} == 1
        for: 15m
        labels:
          severity: ticket
          team: applied-ml
        annotations:
          summary: "Drift detected on a critical feature"
          description: "Model {{ $labels.model_name }} has sustained drift on feature {{ $labels.feature_name }}."

      - alert: MLFeatureMissingRatioHigh
        expr: ml_feature_missing_ratio > 0.05
        for: 15m
        labels:
          severity: ticket
          team: data-platform
        annotations:
          summary: "Feature missing ratio exceeded 5%"
          description: "Feature {{ $labels.feature_name }} for model {{ $labels.model_name }} is missing more often than expected."

      - alert: MLPredictionDistributionShift
        expr: abs(
          job:ml_prediction_score_p50:30m
          -
          job:ml_prediction_score_p50:30m offset 24h
        ) > 0.20
        for: 30m
        labels:
          severity: ticket
          team: applied-ml
        annotations:
          summary: "Median prediction score shifted materially"
          description: "Model {{ $labels.model_name }} version {{ $labels.model_version }} has a large change in prediction distribution compared with 24 hours ago."

A few operational points:

  • Page on latency and error rate because they usually require immediate action.
  • Ticket on drift first unless the model is high-risk and directly customer-facing.
  • Split ownership. Missing-feature alerts belong with the data pipeline owner, not just the model owner.

This is where a lot of monitoring setups fail. Everything gets routed to one platform team, which means nobody owns root cause.

Step 6: Route Alerts with Alertmanager So the Right Team Sees the Right Failure

Prometheus rules are only half the job. You also need routing logic that matches the labels you put on alerts.

A minimal alertmanager.yml can already separate service-health issues from data-quality issues:

route:
  receiver: default-slack
  group_by: ["alertname", "team", "model_name"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - matchers:
        - severity="page"
        - team="ml-platform"
      receiver: pagerduty-ml-platform

    - matchers:
        - severity="ticket"
        - team="applied-ml"
      receiver: slack-applied-ml

    - matchers:
        - severity="ticket"
        - team="data-platform"
      receiver: slack-data-platform

receivers:
  - name: default-slack
    slack_configs:
      - channel: "#ml-observability"
        send_resolved: true

  - name: pagerduty-ml-platform
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_ROUTING_KEY}"
        severity: "critical"

  - name: slack-applied-ml
    slack_configs:
      - channel: "#applied-ml-alerts"
        send_resolved: true

  - name: slack-data-platform
    slack_configs:
      - channel: "#data-platform-alerts"
        send_resolved: true

This looks simple, but it is one of the highest-leverage parts of the setup.

If latency pages the ML researchers at 2 a.m., your routing is wrong. If a feature pipeline regression only lands in a generic shared Slack channel, your routing is also wrong. Alert labels only become useful once they control delivery.

Step 7: Keep a Few PromQL Drill-Down Queries Ready for Incident Response

Dashboards are useful, but during incidents people still need ad hoc queries. Keep a few standard PromQL expressions in the runbook so engineers are not inventing them under pressure.

# Which model version is driving errors right now?
sum by (model_version) (
  rate(ml_inference_requests_total{status="error"}[5m])
)

# Which routes are over the latency SLO?
job:ml_inference_latency_p95_seconds:5m > 0.35

# Which features are currently in drift for a given model?
ml_data_drift_detected{model_name="claims-risk"} == 1

# Which feature started going missing after a deploy?
max_over_time(ml_feature_missing_ratio[1h])

# Did the prediction distribution move compared with yesterday?
job:ml_prediction_score_p50:30m
- on(model_name, model_version)
job:ml_prediction_score_p50:30m offset 24h

The point is not to memorize PromQL. The point is to standardize the questions your operators ask first:

  • Is this isolated to one model version?
  • Is it only one route?
  • Is the issue request-path health or input-data quality?
  • Did the symptoms start after a model or feature release?

Monitoring systems become more reliable when teams standardize the first five diagnostic questions instead of improvising them.

Step 8: Build the Grafana Dashboard Operators Actually Need

Grafana should not be a museum of charts. It should support triage.

A good first dashboard has these sections:

Row 1: Is the service healthy?

  • requests per second
  • p95 inference latency
  • error rate
  • current replicas or pod availability

If these are red, the operator starts with service health before asking whether the model has drifted.

Row 2: Which model or version is responsible?

  • traffic by model version
  • latency by version
  • error rate by version
  • top routes

This row catches bad rollouts quickly.

Row 3: Is the input data changing?

  • drift score by feature
  • number of triggered drift features
  • missing-feature ratio
  • prediction-score distribution trend

This row is the bridge between infra monitoring and model monitoring.

Row 4: What changed recently?

  • deploy marker annotations
  • training baseline version
  • feature pipeline release version

If you do not annotate releases, incident review becomes guesswork.

You can download a starter dashboard JSON here:

Download the starter Grafana dashboard JSON

That dashboard includes panels for:

  • request rate
  • p95 latency
  • error rate
  • traffic by version
  • drift score by feature
  • missing-feature ratio

If you provision dashboards as code, keep the provider config in the repo too:

apiVersion: 1

providers:
  - name: ml-monitoring
    folder: ML Monitoring
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards

That matters because teams often version alert rules but leave dashboards as hand-edited UI artifacts. You want both under source control.

Step 9: Wire Alerts into an Actual Response Flow

Monitoring without response design just creates noise.

For this stack, a basic ownership model looks like:

  • MLInferenceHighLatencyP95 -> platform/on-call
  • MLInferenceHighErrorRate -> platform/on-call
  • MLFeatureMissingRatioHigh -> data platform
  • MLDriftCriticalFeatureDetected -> applied ML during business hours unless the use case is high risk

Each alert should link to:

  • dashboard URL
  • runbook URL
  • rollback or mitigation procedure

A useful latency runbook usually starts with:

  1. Check whether the issue is limited to one model_version.
  2. Check pod restarts, CPU saturation, and memory pressure.
  3. Check whether feature-fetch latency or downstream dependencies regressed.
  4. If only the canary version is affected, shift traffic back immediately.

A useful drift runbook starts with:

  1. Confirm that the drift signal is on real production traffic and not a broken batch job.
  2. Identify the affected features and the upstream system that produces them.
  3. Compare against recent schema, ETL, or feature-store releases.
  4. Decide whether the mitigation is pipeline rollback, threshold adjustment, or model retraining.

This distinction is important. A drift alert is often a data problem long before it becomes a model retraining problem.

Common Mistakes When Teams Set This Up

Treating logs as the primary monitoring layer

Logs are necessary for debugging individual requests. They are a poor substitute for metrics during an incident. If an operator needs to grep logs to learn the error rate, the system is under-instrumented.

Overloading labels

Prometheus is not a data warehouse. High-cardinality labels such as user IDs, prompt text, policy IDs, or transaction IDs will make the system slower and more expensive. Put those in logs or traces instead.

Alerting on drift with no baseline hygiene

Drift scores are only as useful as the baseline they compare against. If the reference data is stale, mixed across incompatible populations, or updated ad hoc, your drift alerts will be noisy and untrustworthy.

No separation between health metrics and quality metrics

Latency and 5xx errors are infrastructure-adjacent. Drift and prediction shifts are model-adjacent. Treating them as the same class of incident is how teams end up paging the wrong people.

No version dimension

If you cannot slice metrics by model version, deployment analysis is mostly blind. Every meaningful dashboard in production ML eventually becomes a dashboard grouped by version.

A Practical Rollout Plan

If you are building this from scratch, do it in four phases:

Phase 1: Service health

Ship:

  • request counters
  • latency histograms
  • error counters
  • Grafana panels for RPS, p95, and error rate

This gives you immediate visibility and is usually enough to replace guesswork during incidents.

Phase 2: Model-aware metrics

Add:

  • model version labels
  • prediction distribution histogram
  • feature-missing gauges

This is where the monitoring stack becomes genuinely ML-specific.

Phase 3: Drift metrics

Add:

  • scheduled drift computation
  • exporter service
  • drift dashboards
  • ticket-level alerts

Do not wait for perfect online quality monitoring. Drift proxies deliver value early.

Phase 4: Release-aware operations

Add:

  • deployment annotations in Grafana
  • rollback runbooks
  • model version to traffic share panels
  • SLO-based alerts tied to owner teams

At this point, the monitoring stack becomes part of the release process instead of a separate observability project.

Final Takeaway

The goal of ml model monitoring prometheus grafana is not to produce prettier charts. It is to reduce the time between a model starting to misbehave and a team knowing exactly what changed.

If you remember only one design principle from this guide, make it this:

  • health metrics come from the serving path
  • drift and quality metrics come from analysis jobs
  • Grafana ties them together
  • alerts map to clear owners

That is the foundation of a monitoring system operators can actually trust.

If you implement the instrumentation, scrape config, recording rules, alert rules, and dashboard structure above, you will have a production-ready baseline instead of a collection of disconnected charts.

Share this article

Help others discover this content

Share with hashtags:

#Monitoring#Prometheus#Grafana#Model Drift#Observability
RT

Resilio Tech Team

Building AI infrastructure tools and sharing knowledge to help companies deploy ML systems reliably.

Article Info

Published4/8/2026
Reading Time15 min read
Words2,965
Scale Your AI Infrastructure

Ready to move from notebook to production?

We help companies deploy, scale, and operate AI systems reliably. Book a free 30-minute audit to discuss your specific infrastructure challenges.