One of the most common debugging mistakes in vLLM on Kubernetes is assuming every memory failure is a GPU issue. It's not. A pod can fail because it exceeded its Kubernetes system memory limit (OOMKilled) or because the model runtime hit CUDA out-of-memory (OOM) on the GPU itself.
If you don't separate these two, you'll spend hours tuning VRAM when the real problem is your host RAM, or vice versa. This is a critical part of capacity planning for LLM inference.
The Distinction: OOMKilled vs. CUDA OOM
- Kubernetes
OOMKilled(Exit Code 137): The Linux kernel killed the container because it exceeded itslimits.memory. This is a system RAM issue. - CUDA OOM: The application logs show a
RuntimeError: CUDA out of memory. This is a VRAM issue on the GPU.
Technical Implementation: Kubernetes Resource Configuration
When defining your deployment, you must provide enough headroom for both CPU and GPU memory. A common mistake is setting memory limits too tight, leading to crashes during model loading.
# Recommended GPU Deployment Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference-server
spec:
template:
spec:
containers:
- name: inference-engine
image: vllm/vllm-openai:latest
resources:
requests:
cpu: "4"
memory: "16Gi" # Host RAM Request
nvidia.com/gpu: "1" # GPU Request
limits:
cpu: "8"
memory: "32Gi" # Host RAM Limit (provide headroom!)
nvidia.com/gpu: "1"
Monitoring Both Planes with Prometheus
To prevent these issues, you need to monitor both host memory and GPU memory. Using the NVIDIA DCGM Exporter alongside Kube-State-Metrics is the industry standard for AI infrastructure observability.
# Alert if Host RAM usage is > 90% of limit
(container_memory_working_set_bytes{container!=""} / kube_pod_container_resource_limits{resource="memory"}) > 0.9
# Monitor GPU VRAM Utilization
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL
Memory-Aware Batching in Python
If your preprocessing logic is intensive, it can spike host RAM before data ever reaches the GPU. Implementing a simple memory check before expanding batches can prevent OOMKilled events.
import psutil
def should_process_next_batch(threshold_percent=85):
# Check host RAM usage to prevent OOMKilled
mem = psutil.virtual_memory()
if mem.percent > threshold_percent:
print(f"Host memory high ({mem.percent}%). Throttling batch...")
return False
return True
# In your inference loop
if should_process_next_batch():
# process large tensor batch
pass
Scaling and Prevention
As you implement GPU autoscaling, ensure your scaling triggers account for both VRAM saturation and system memory pressure. Many teams discover that their pods crash during scale-up because the model-loading phase exceeds the initial memory requests.
Final Takeaway
Debugging GPU pods in Kubernetes requires a dual-plane approach. You must monitor and tune for both host RAM and GPU VRAM independently. Most OOMKilled events are caused by insufficient host RAM during model loading or large request buffering, not by the GPU itself.
Resilio Tech specializes in hardening Kubernetes clusters for AI workloads. We help companies eliminate "flaky" GPU pods by optimizing resource limits, implementing robust observability, and ensuring your MLOps stack is resilient to memory spikes.
Tired of random GPU pod crashes? Talk to Resilio Tech for a deep-dive audit of your Kubernetes AI infrastructure.