There is a point where "just use a bigger GPU" stops being a strategy. Whether it's because a model like Llama-3-70B simply won't fit in 80GB of VRAM with a decent KV cache, or because your capacity planning requires higher throughput than one machine can provide, you will eventually face the transition to distributed inference.
Scaling vLLM on Kubernetes from a single pod to a multi-node cluster is a significant architectural leap. It introduces networking bottlenecks, collective communication overhead (NCCL), and complex failure domains.
Tensor Parallelism vs. Pipeline Parallelism
When you scale ml inference beyond single gpu, you have two primary levers:
1. Tensor Parallelism (TP)
TP splits individual layers across multiple GPUs. This is the gold standard for low-latency inference on large models because it parallelizes the actual matrix multiplications. However, it requires extremely high-bandwidth interconnects (NVLink).
# Example: vLLM with Tensor Parallelism across 4 GPUs
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9
2. Pipeline Parallelism (PP)
PP splits the model layers into stages and places each stage on a different device or node. While it helps with memory constraints, it introduces "pipeline bubbles" where GPUs sit idle waiting for the next stage. This is often combined with TP for ultra-large models.
Orchestrating Multi-Node with Ray
For true multi-node serving (not just multi-GPU on one node), tools like Ray are essential. Ray manages the worker orchestration and the NCCL backend needed for GPUs to communicate over the network.
Technical Snippet: RayService on Kubernetes
Using the KubeRay operator is the modern way to manage multi-model serving across a cluster.
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llama-3-dist
spec:
rayClusterConfig:
workerGroupSpecs:
- replicas: 2 # Two nodes, each with GPUs
groupName: gpu-group
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
resources:
limits:
nvidia.com/gpu: "4"
The Networking Bottleneck: NCCL and RoCE
In a multi-node inference setup, your network becomes the backplane. If you are running over standard 10GbE without RoCE (RDMA over Converged Ethernet), your tensor parallelism will likely be slower than a single-node setup due to high latency.
Engineering leaders must ensure that GPU autoscaling policies account for node locality—keeping participating GPUs as close as possible in the network topology.
Final Takeaway
Moving from single-GPU to multi-node inference is an inevitable step for any organization scaling high-performance AI. However, it is not a "magic button." Success requires a deep understanding of interconnect bandwidth, model partitioning strategies, and robust orchestration via tools like Ray and vLLM.
Resilio Tech specializes in the "hard" parts of distributed inference. We help companies architect multi-node clusters that actually deliver on the promise of higher throughput, rather than just adding latency. From NCCL tuning to KubeRay implementation, we ensure your largest models run with enterprise-grade stability.
Outgrowing your single-node setup? Contact Resilio Tech for a distributed inference roadmap and performance audit.