Serving Open-Source LLMs with vLLM on Kubernetes
A practical guide to deploying open-source LLMs with vLLM on Kubernetes — covering GPU sizing, request routing, autoscaling, batching, and safe rollouts.
We share everything we learn — real use cases, real production lessons. Technical deep-dives on MLOps, model deployment, AI reliability, and more.
📝 Building in public
Posts authored by the Resilio Tech Team. More in-depth tutorials and case studies coming soon.
A practical guide to deploying open-source LLMs with vLLM on Kubernetes — covering GPU sizing, request routing, autoscaling, batching, and safe rollouts.
A practical guide to rolling out ML models safely in production using shadow traffic, canary promotion, quality gates, and fast rollback paths.
How to autoscale GPU-backed inference clusters without wasting money, including queue-based scaling, warm capacity, and right-sizing by workload profile.
Practical strategies for reducing GPU infrastructure costs — covering spot instances, GPU scheduling, model optimization, and right-sizing — without degrading inference quality.
How to serve many ML models on shared infrastructure without noisy-neighbor problems, unpredictable latency, or runaway GPU spend.
Why standard API load-testing assumptions break for LLM inference, and how to design tests that reflect token generation, concurrency, and real serving bottlenecks.
How to measure token-level inference spend in production and add practical controls around prompt size, output limits, routing, caching, and tenant budgets.
How to run ML training workloads on spot or preemptible capacity safely, with checkpointing, interruption handling, retry policy, and pipeline design for fault tolerance.
3/30/2026 • 6 min read
3/29/2026 • 8 min read
3/28/2026 • 6 min read
3/27/2026 • 5 min read