Skip to main content
0%
Model Deployment

Zero-Downtime Model Updates: Blue-Green and Rolling Deployments for ML Systems

A deep guide to zero-downtime model deployment, covering blue-green and rolling updates for ML systems, stateful model servers, in-flight request handling, and warm-up strategies.

4 min read661 words

Most deployment advice for ML systems stops too early. It tells you to use canaries and roll back quickly if metrics degrade. That is useful, but it is not enough when the serving system itself is stateful, memory-heavy, or expensive to warm.

Zero downtime model deployment for ML systems needs more than a generic rolling update. It requires a strategy that accounts for GPU memory stabilization, KV cache warming, and weight loading times.

In this guide, we'll explore how to handle these operational details in production, building on our foundations for serving open-source LLMs with vLLM on Kubernetes.

Why Zero-Downtime Is Harder for ML Systems

A typical stateless web service passes readiness quickly. A model server, however, can look “up” before it is genuinely ready to handle production load. If you rely on a simple rolling update ml models strategy without ML-aware health checks, you will see latency spikes as the new replicas struggle to load weights or compile kernels.

Blue-Green Versus Rolling: The Core Tradeoff

Rolling deployments

Rolling updates replace replicas gradually. This is a lower peak capacity cost but can cause partial capacity drops while new replicas warm. This is acceptable for lightweight classifiers but risky for large models.

Blue-green deployments

Blue-green keeps the current environment intact while the new environment is prepared separately. This is the stronger option for heavyweight runtimes like vLLM or Triton. It allows you to run shadow traffic validation on the "green" environment before switching users over.

Automating Model Validation with Argo Rollouts

For production ML on Kubernetes, we recommend tools like Argo Rollouts or Flagger. These allow you to define an AnalysisTemplate that automatically checks model metrics before promoting a new version.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: model-latency-check
spec:
  metrics:
  - name: p95-latency
    interval: 1m
    successCondition: result[0] < 200
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          histogram_quantile(0.95, sum(rate(model_inference_latency_seconds_bucket[2m])) by (le, version))

Readiness Checks Must Reflect Model Reality

A normal HTTP health check is not enough. For ML systems, your readiness probe should validate:

  • Model artifact is fully loaded into GPU memory.
  • Tokenizer resources are available.
  • A synthetic inference passes within a specific threshold.

Handling In-Flight Requests Correctly

When an old replica is terminated, you must avoid interrupting long-running generations. Use preStop hooks to stop new traffic while allowing active work to complete. This is critical for LLMs where a single request might take several seconds to stream.

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 30"]
terminationGracePeriodSeconds: 120

Scaling the Rollout

As you roll out new models, ensure your infrastructure can handle the temporary surge. If you are using GPU autoscaling, verify that you have enough head-room in your cluster to stand up the "green" environment without triggering node-pressure evictions.

The Stronger Pattern: Shadow to Blue-Green

For high-stakes models, follow this sequence:

  1. Deploy Green: Load and warm the new model version.
  2. Shadow Traffic: Send a copy of production traffic to the green environment.
  3. Analyze: Compare outputs and latency between blue and green.
  4. Cutover: Flip the switch at the gateway level (e.g., Istio VirtualService or Nginx ingress).
  5. Soak: Keep blue alive for 15-30 minutes for instant rollback.

Final Takeaway

Zero-downtime model deployment is not just a configuration setting; it's an observed, data-driven process.

The Key Lessons:

  1. Match the Strategy to the Load: Use Blue-Green for heavy models (LLMs) and Rolling for lightweight ones.
  2. Validate Behavior, Not Just Health: Use synthetic requests and shadow traffic to prove the model is ready.
  3. Graceful Draining is Mandatory: Don't kill pods while they're mid-generation.

At Resilio Tech, we specialize in building these robust deployment pipelines for enterprise AI. We ensure that your "state-of-the-art" model is backed by "state-of-the-art" reliability.

Ready to harden your deployment pipeline? Check out our MLOps services or contact us to design a zero-downtime deployment strategy for your GPU fleet.

Share this article

Help others discover this content

Share with hashtags:

#Model Deployment#Kubernetes#Rollouts#Mlops#Llm Serving
RT

Resilio Tech Team

Building AI infrastructure tools and sharing knowledge to help companies deploy ML systems reliably.

Article Info

Published4/10/2026
Reading Time4 min read
Words661
Scale Your AI Infrastructure

Ready to move from notebook to production?

We help companies deploy, scale, and operate AI systems reliably. Book a free 30-minute audit to discuss your specific infrastructure challenges.