Production ML monitoring has to answer a specific question: is model quality degrading even though the API still returns 200? To answer this, you need a monitoring stack that connects infrastructure health with model-specific behavior like drift and prediction distribution shifts.
The Monitoring Stack
For most teams, Prometheus and Grafana are the best starting point. They provide the flexibility to track standard service SLOs alongside custom ML metrics.
Prometheus Recording Rules for Faster Dashboards
Calculating p95 latency on raw histograms can be expensive. Use recording rules to pre-calculate these values for your incident response runbooks:
groups:
- name: ml_performance_rules
rules:
- record: job:ml_inference_latency_p95:5m
expr: histogram_quantile(0.95, sum by (le, model_name) (rate(ml_inference_latency_seconds_bucket[5m])))
- record: job:ml_prediction_drift:score
expr: abs(ml_prediction_score_mean - ml_prediction_score_baseline)
Actionable Alerting
An alert without a runbook is just noise. Your stack should alert on:
- Critical Latency: When p95 exceeds your SLA targets.
- Feature Drift: When input distributions shift meaningfully from training.
- Accuracy Proxies: When prediction class balance moves unexpectedly.
Final Takeaway
Monitoring is the only way to move from "it works in my notebook" to "it works in production." By standardizing your metrics and dashboards, you reduce the time between a model failing and your team fixing it.
Need help setting up production-grade monitoring for your ML models? We help teams build Prometheus/Grafana stacks that catch drift and regressions before users do. Book a free infrastructure audit and we’ll review your monitoring and alerting setup.