Serving Open-Source LLMs with vLLM on Kubernetes
A practical guide to deploying open-source LLMs with vLLM on Kubernetes — covering GPU sizing, request routing, autoscaling, batching, and safe rollouts.
We share everything we learn — real use cases, real production lessons. Technical deep-dives on MLOps, model deployment, AI reliability, and more.
📝 Building in public
Posts authored by the Resilio Tech Team. More in-depth tutorials and case studies coming soon.
A practical guide to deploying open-source LLMs with vLLM on Kubernetes — covering GPU sizing, request routing, autoscaling, batching, and safe rollouts.
How to autoscale GPU-backed inference clusters without wasting money, including queue-based scaling, warm capacity, and right-sizing by workload profile.
Practical strategies for reducing GPU infrastructure costs — covering spot instances, GPU scheduling, model optimization, and right-sizing — without degrading inference quality.
How to serve many ML models on shared infrastructure without noisy-neighbor problems, unpredictable latency, or runaway GPU spend.
A practical guide to batching LLM inference workloads, including static batching, dynamic batching, queue controls, and when higher throughput starts hurting latency.
How to use Terraform to provision AI infrastructure safely, with practical guidance on GPU node pools, registries, pipeline dependencies, and avoiding drift across environments.
How to run ML training workloads on spot or preemptible capacity safely, with checkpointing, interruption handling, retry policy, and pipeline design for fault tolerance.
3/30/2026 • 6 min read
3/29/2026 • 8 min read
3/28/2026 • 6 min read
3/27/2026 • 5 min read