Most model monitoring guides stop at generic advice: track latency, watch drift, build a dashboard.
That is not enough when you are running real inference traffic.
Production ML monitoring has to answer concrete questions:
- Is the serving path healthy right now?
- Is model quality degrading even though the API still returns
200? - Did a rollout break one model version or one feature pipeline?
- Which alerts should wake up an operator, and which ones should stay on the dashboard?
This guide shows how to build a working ml model monitoring prometheus grafana stack for production use. The focus is practical:
- application instrumentation
- Prometheus scrape configuration
- recording rules and alert rules
- Grafana dashboards for operators
- a downloadable starter dashboard JSON
The examples assume a Python model service, but the same monitoring design works for Java, Go, or Rust serving stacks.
What a Useful ML Monitoring Stack Must Cover
Traditional service monitoring is necessary, but it is only the first layer.
For model-serving systems, you need at least four categories of telemetry:
1. Service health
This is the baseline:
- request rate
- success and error rate
- p50, p95, and p99 latency
- saturation and queue depth
- replica restarts
Without this layer, you cannot distinguish a model issue from a normal service incident.
2. Inference behavior
This is where ML systems start to diverge from normal web apps:
- per-model latency
- per-version traffic share
- batch size
- feature completeness
- prediction distribution shifts
These metrics tell you whether the model server is behaving differently even when the infrastructure looks healthy.
3. Data and quality proxies
Most teams cannot compute ground-truth accuracy in real time. That is normal. Instead, they monitor leading indicators:
- drift score by feature
- rate of null or defaulted features
- out-of-range values
- prediction class balance
- business proxy metrics such as approval rate, fraud review rate, or escalation rate
This is the minimum viable answer to model performance monitoring guide in the real world. You need signals that move before customers complain.
4. Actionable alerting
Dashboards are for diagnosis. Alerts are for interruption. They should be reserved for conditions with a clear response path:
- latency above SLO
- error rate above threshold
- sustained drift on critical features
- sudden changes in prediction mix
- missing-feature rate above safe bounds
The mistake is not a lack of alerts. The mistake is alerts with no owner and no runbook.
Reference Architecture
The cleanest way to setup ml monitoring is to split metrics into two planes:
- Online metrics emitted by the inference service
- Batch or streaming quality metrics computed outside the request path
That gives you a monitoring architecture like this:
Clients
|
v
Inference API / Model Server ----> /metrics endpoint ----> Prometheus
| / | \
| / | \
|---- structured logs ------------------------------>/ Alerting Grafana
|
|---- predictions + features ----> feature store / warehouse
|
v
drift job / quality batch job
|
v
metrics exporter
|
v
Prometheus
The key design choice is that drift and quality metrics are usually not computed inline on every request. They are produced by scheduled jobs or stream processors and exposed to Prometheus through an exporter.
That keeps the serving path fast while still giving operators visibility into model behavior.
Step 1: Instrument the Model Service
Start with request, error, and model-specific metrics in the application itself. For a Python service, prometheus_client is enough.
from fastapi import FastAPI, HTTPException
from prometheus_client import Counter, Histogram, Gauge, generate_latest
from prometheus_client import CONTENT_TYPE_LATEST
from starlette.responses import Response
import time
app = FastAPI()
INFERENCE_REQUESTS = Counter(
"ml_inference_requests_total",
"Total inference requests",
["model_name", "model_version", "route", "status"]
)
INFERENCE_LATENCY = Histogram(
"ml_inference_latency_seconds",
"Inference latency in seconds",
["model_name", "model_version", "route"],
buckets=(0.01, 0.025, 0.05, 0.1, 0.2, 0.35, 0.5, 0.75, 1.0, 2.0, 5.0)
)
FEATURE_MISSING_RATE = Gauge(
"ml_feature_missing_ratio",
"Missing feature ratio for the last evaluation window",
["model_name", "feature_name"]
)
PREDICTION_SCORE = Histogram(
"ml_prediction_score",
"Distribution of model prediction scores",
["model_name", "model_version"],
buckets=(0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)
)
MODEL_INFO = Gauge(
"ml_model_info",
"Static information about the deployed model",
["model_name", "model_version", "framework"]
)
MODEL_INFO.labels(
model_name="claims-risk",
model_version="2026-04-01",
framework="xgboost"
).set(1)
@app.get("/metrics")
def metrics():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
@app.post("/predict")
def predict(payload: dict):
start = time.perf_counter()
model_name = "claims-risk"
model_version = "2026-04-01"
route = "predict"
try:
feature_count = len(payload.get("features", {}))
if feature_count == 0:
raise HTTPException(status_code=400, detail="missing features")
score = 0.82
PREDICTION_SCORE.labels(model_name, model_version).observe(score)
INFERENCE_REQUESTS.labels(model_name, model_version, route, "success").inc()
return {"score": score, "model_version": model_version}
except Exception:
INFERENCE_REQUESTS.labels(model_name, model_version, route, "error").inc()
raise
finally:
duration = time.perf_counter() - start
INFERENCE_LATENCY.labels(model_name, model_version, route).observe(duration)
Three implementation details matter here:
- Label by
model_nameandmodel_version, or you will not be able to isolate a bad rollout. - Keep label cardinality controlled. Do not use
customer_id,request_id, or raw feature values as labels. - Put prediction-distribution metrics behind a histogram or pre-aggregated gauge. Do not emit every score as a unique log-only event and then expect Prometheus to save you.
Step 2: Export Drift and Data Quality Signals
Latency and errors come from the live service. Drift usually comes from a scheduled comparison job.
For example, you might compare the latest 15-minute inference window against a training baseline stored in object storage or a feature store snapshot. That job can expose metrics like:
ml_data_drift_score{model_name="claims-risk",feature_name="income"}ml_data_drift_detected{model_name="claims-risk",feature_name="income"}ml_feature_missing_ratio{model_name="claims-risk",feature_name="zipcode"}
A lightweight exporter process is often enough:
from prometheus_client import Gauge, start_http_server
import time
DRIFT_SCORE = Gauge(
"ml_data_drift_score",
"Current drift score by feature",
["model_name", "feature_name"]
)
DRIFT_DETECTED = Gauge(
"ml_data_drift_detected",
"1 when drift threshold is exceeded",
["model_name", "feature_name"]
)
def compute_feature_drift():
# Replace with your actual statistical test or Evidently output.
return {
"income": 0.19,
"zipcode": 0.04,
"claim_amount": 0.23,
}
if __name__ == "__main__":
start_http_server(9109)
model_name = "claims-risk"
while True:
scores = compute_feature_drift()
for feature_name, score in scores.items():
DRIFT_SCORE.labels(model_name, feature_name).set(score)
DRIFT_DETECTED.labels(model_name, feature_name).set(1 if score > 0.15 else 0)
time.sleep(60)
You can run this exporter as:
- a sidecar next to the batch worker
- a standalone deployment fed by Kafka or a warehouse query
- an Airflow-triggered service that refreshes gauges every few minutes
The correct choice depends on your platform. The metric shape is what matters.
Step 3: Configure Prometheus to Scrape Both the Model Service and the Drift Exporter
You now have two scrape targets: the inference API and the quality exporter.
Here is a working prometheus.yml pattern:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/ml-recording-rules.yml
- /etc/prometheus/rules/ml-alert-rules.yml
scrape_configs:
- job_name: model-inference
metrics_path: /metrics
static_configs:
- targets:
- model-api.monitoring.svc.cluster.local:8000
labels:
service: model-api
environment: production
- job_name: drift-exporter
metrics_path: /metrics
static_configs:
- targets:
- model-drift-exporter.monitoring.svc.cluster.local:9109
labels:
service: drift-exporter
environment: production
- job_name: node-exporter
static_configs:
- targets:
- node-exporter.monitoring.svc.cluster.local:9100
If you are already on the Prometheus Operator, the same idea usually becomes a ServiceMonitor. The underlying design does not change:
- scrape live inference metrics frequently
- scrape drift and quality metrics frequently enough to reflect the batch update interval
Do not scrape drift once every 15 seconds if the underlying computation only refreshes every 15 minutes. That creates visual noise without adding signal.
Step 4: Add Recording Rules for Fast Dashboards
PromQL is powerful, but heavy percentile queries across raw histograms can make dashboards sluggish. Recording rules create stable, reusable series for dashboards and alerting.
groups:
- name: ml-recording-rules
interval: 30s
rules:
- record: job:ml_inference_rps:rate5m
expr: sum by (model_name, model_version, route) (
rate(ml_inference_requests_total[5m])
)
- record: job:ml_inference_error_rate:ratio5m
expr: (
sum by (model_name, model_version, route) (
rate(ml_inference_requests_total{status="error"}[5m])
)
) / (
sum by (model_name, model_version, route) (
rate(ml_inference_requests_total[5m])
)
)
- record: job:ml_inference_latency_p95_seconds:5m
expr: histogram_quantile(
0.95,
sum by (le, model_name, model_version, route) (
rate(ml_inference_latency_seconds_bucket[5m])
)
)
- record: job:ml_prediction_score_p50:30m
expr: histogram_quantile(
0.50,
sum by (le, model_name, model_version) (
rate(ml_prediction_score_bucket[30m])
)
)
- record: job:ml_drift_features_triggered:current
expr: sum by (model_name) (ml_data_drift_detected)
These recording rules pay off immediately:
- dashboards render faster
- alert expressions stay readable
- operators reuse the same definitions everywhere
If your team debates whether p95 was calculated differently in two dashboards, you already have a monitoring consistency problem. Recording rules are the fix.
Step 5: Define Alert Rules for Latency, Errors, and Drift
This is the part most teams underinvest in. They build a dashboard but never decide what should actually trigger a page.
Start with a small alert set that covers the main failure modes.
groups:
- name: ml-alert-rules
rules:
- alert: MLInferenceHighLatencyP95
expr: job:ml_inference_latency_p95_seconds:5m > 0.35
for: 10m
labels:
severity: page
team: ml-platform
annotations:
summary: "P95 inference latency is above 350ms"
description: "Model {{ $labels.model_name }} version {{ $labels.model_version }} on route {{ $labels.route }} has sustained high latency."
- alert: MLInferenceHighErrorRate
expr: job:ml_inference_error_rate:ratio5m > 0.03
for: 10m
labels:
severity: page
team: ml-platform
annotations:
summary: "Inference error rate is above 3%"
description: "Model {{ $labels.model_name }} version {{ $labels.model_version }} is returning elevated errors."
- alert: MLDriftCriticalFeatureDetected
expr: ml_data_drift_detected{feature_name=~"income|claim_amount|age"} == 1
for: 15m
labels:
severity: ticket
team: applied-ml
annotations:
summary: "Drift detected on a critical feature"
description: "Model {{ $labels.model_name }} has sustained drift on feature {{ $labels.feature_name }}."
- alert: MLFeatureMissingRatioHigh
expr: ml_feature_missing_ratio > 0.05
for: 15m
labels:
severity: ticket
team: data-platform
annotations:
summary: "Feature missing ratio exceeded 5%"
description: "Feature {{ $labels.feature_name }} for model {{ $labels.model_name }} is missing more often than expected."
- alert: MLPredictionDistributionShift
expr: abs(
job:ml_prediction_score_p50:30m
-
job:ml_prediction_score_p50:30m offset 24h
) > 0.20
for: 30m
labels:
severity: ticket
team: applied-ml
annotations:
summary: "Median prediction score shifted materially"
description: "Model {{ $labels.model_name }} version {{ $labels.model_version }} has a large change in prediction distribution compared with 24 hours ago."
A few operational points:
- Page on latency and error rate because they usually require immediate action.
- Ticket on drift first unless the model is high-risk and directly customer-facing.
- Split ownership. Missing-feature alerts belong with the data pipeline owner, not just the model owner.
This is where a lot of monitoring setups fail. Everything gets routed to one platform team, which means nobody owns root cause.
Step 6: Route Alerts with Alertmanager So the Right Team Sees the Right Failure
Prometheus rules are only half the job. You also need routing logic that matches the labels you put on alerts.
A minimal alertmanager.yml can already separate service-health issues from data-quality issues:
route:
receiver: default-slack
group_by: ["alertname", "team", "model_name"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- matchers:
- severity="page"
- team="ml-platform"
receiver: pagerduty-ml-platform
- matchers:
- severity="ticket"
- team="applied-ml"
receiver: slack-applied-ml
- matchers:
- severity="ticket"
- team="data-platform"
receiver: slack-data-platform
receivers:
- name: default-slack
slack_configs:
- channel: "#ml-observability"
send_resolved: true
- name: pagerduty-ml-platform
pagerduty_configs:
- routing_key: "${PAGERDUTY_ROUTING_KEY}"
severity: "critical"
- name: slack-applied-ml
slack_configs:
- channel: "#applied-ml-alerts"
send_resolved: true
- name: slack-data-platform
slack_configs:
- channel: "#data-platform-alerts"
send_resolved: true
This looks simple, but it is one of the highest-leverage parts of the setup.
If latency pages the ML researchers at 2 a.m., your routing is wrong. If a feature pipeline regression only lands in a generic shared Slack channel, your routing is also wrong. Alert labels only become useful once they control delivery.
Step 7: Keep a Few PromQL Drill-Down Queries Ready for Incident Response
Dashboards are useful, but during incidents people still need ad hoc queries. Keep a few standard PromQL expressions in the runbook so engineers are not inventing them under pressure.
# Which model version is driving errors right now?
sum by (model_version) (
rate(ml_inference_requests_total{status="error"}[5m])
)
# Which routes are over the latency SLO?
job:ml_inference_latency_p95_seconds:5m > 0.35
# Which features are currently in drift for a given model?
ml_data_drift_detected{model_name="claims-risk"} == 1
# Which feature started going missing after a deploy?
max_over_time(ml_feature_missing_ratio[1h])
# Did the prediction distribution move compared with yesterday?
job:ml_prediction_score_p50:30m
- on(model_name, model_version)
job:ml_prediction_score_p50:30m offset 24h
The point is not to memorize PromQL. The point is to standardize the questions your operators ask first:
- Is this isolated to one model version?
- Is it only one route?
- Is the issue request-path health or input-data quality?
- Did the symptoms start after a model or feature release?
Monitoring systems become more reliable when teams standardize the first five diagnostic questions instead of improvising them.
Step 8: Build the Grafana Dashboard Operators Actually Need
Grafana should not be a museum of charts. It should support triage.
A good first dashboard has these sections:
Row 1: Is the service healthy?
- requests per second
- p95 inference latency
- error rate
- current replicas or pod availability
If these are red, the operator starts with service health before asking whether the model has drifted.
Row 2: Which model or version is responsible?
- traffic by model version
- latency by version
- error rate by version
- top routes
This row catches bad rollouts quickly.
Row 3: Is the input data changing?
- drift score by feature
- number of triggered drift features
- missing-feature ratio
- prediction-score distribution trend
This row is the bridge between infra monitoring and model monitoring.
Row 4: What changed recently?
- deploy marker annotations
- training baseline version
- feature pipeline release version
If you do not annotate releases, incident review becomes guesswork.
You can download a starter dashboard JSON here:
Download the starter Grafana dashboard JSON
That dashboard includes panels for:
- request rate
- p95 latency
- error rate
- traffic by version
- drift score by feature
- missing-feature ratio
If you provision dashboards as code, keep the provider config in the repo too:
apiVersion: 1
providers:
- name: ml-monitoring
folder: ML Monitoring
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
That matters because teams often version alert rules but leave dashboards as hand-edited UI artifacts. You want both under source control.
Step 9: Wire Alerts into an Actual Response Flow
Monitoring without response design just creates noise.
For this stack, a basic ownership model looks like:
MLInferenceHighLatencyP95-> platform/on-callMLInferenceHighErrorRate-> platform/on-callMLFeatureMissingRatioHigh-> data platformMLDriftCriticalFeatureDetected-> applied ML during business hours unless the use case is high risk
Each alert should link to:
- dashboard URL
- runbook URL
- rollback or mitigation procedure
A useful latency runbook usually starts with:
- Check whether the issue is limited to one
model_version. - Check pod restarts, CPU saturation, and memory pressure.
- Check whether feature-fetch latency or downstream dependencies regressed.
- If only the canary version is affected, shift traffic back immediately.
A useful drift runbook starts with:
- Confirm that the drift signal is on real production traffic and not a broken batch job.
- Identify the affected features and the upstream system that produces them.
- Compare against recent schema, ETL, or feature-store releases.
- Decide whether the mitigation is pipeline rollback, threshold adjustment, or model retraining.
This distinction is important. A drift alert is often a data problem long before it becomes a model retraining problem.
Common Mistakes When Teams Set This Up
Treating logs as the primary monitoring layer
Logs are necessary for debugging individual requests. They are a poor substitute for metrics during an incident. If an operator needs to grep logs to learn the error rate, the system is under-instrumented.
Overloading labels
Prometheus is not a data warehouse. High-cardinality labels such as user IDs, prompt text, policy IDs, or transaction IDs will make the system slower and more expensive. Put those in logs or traces instead.
Alerting on drift with no baseline hygiene
Drift scores are only as useful as the baseline they compare against. If the reference data is stale, mixed across incompatible populations, or updated ad hoc, your drift alerts will be noisy and untrustworthy.
No separation between health metrics and quality metrics
Latency and 5xx errors are infrastructure-adjacent. Drift and prediction shifts are model-adjacent. Treating them as the same class of incident is how teams end up paging the wrong people.
No version dimension
If you cannot slice metrics by model version, deployment analysis is mostly blind. Every meaningful dashboard in production ML eventually becomes a dashboard grouped by version.
A Practical Rollout Plan
If you are building this from scratch, do it in four phases:
Phase 1: Service health
Ship:
- request counters
- latency histograms
- error counters
- Grafana panels for RPS, p95, and error rate
This gives you immediate visibility and is usually enough to replace guesswork during incidents.
Phase 2: Model-aware metrics
Add:
- model version labels
- prediction distribution histogram
- feature-missing gauges
This is where the monitoring stack becomes genuinely ML-specific.
Phase 3: Drift metrics
Add:
- scheduled drift computation
- exporter service
- drift dashboards
- ticket-level alerts
Do not wait for perfect online quality monitoring. Drift proxies deliver value early.
Phase 4: Release-aware operations
Add:
- deployment annotations in Grafana
- rollback runbooks
- model version to traffic share panels
- SLO-based alerts tied to owner teams
At this point, the monitoring stack becomes part of the release process instead of a separate observability project.
Final Takeaway
The goal of ml model monitoring prometheus grafana is not to produce prettier charts. It is to reduce the time between a model starting to misbehave and a team knowing exactly what changed.
If you remember only one design principle from this guide, make it this:
- health metrics come from the serving path
- drift and quality metrics come from analysis jobs
- Grafana ties them together
- alerts map to clear owners
That is the foundation of a monitoring system operators can actually trust.
If you implement the instrumentation, scrape config, recording rules, alert rules, and dashboard structure above, you will have a production-ready baseline instead of a collection of disconnected charts.