Kubernetes OOMKilled on GPU Pods: Causes, Debugging, and Prevention

One of the most common debugging mistakes in vLLM on Kubernetes is assuming every memory failure is a GPU issue. It's not. A pod can fail because it exceeded its Kubernetes system memory limit (OOMKilled) or because the model runtime hit CUDA out-of-memory (OOM) on the GPU itself.

If you don't separate these two, you'll spend hours tuning VRAM when the real problem is your host RAM, or vice versa. This is a critical part of capacity planning for LLM inference.

The Distinction: OOMKilled vs. CUDA OOM

Kubernetes OOMKilled (Exit Code 137): The Linux kernel killed the container because it exceeded its limits.memory. This is a system RAM issue.
CUDA OOM: The application logs show a RuntimeError: CUDA out of memory. This is a VRAM issue on the GPU.

Technical Implementation: Kubernetes Resource Configuration

When defining your deployment, you must provide enough headroom for both CPU and GPU memory. A common mistake is setting memory limits too tight, leading to crashes during model loading.

# Recommended GPU Deployment Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-server
spec:
  template:
    spec:
      containers:
      - name: inference-engine
        image: vllm/vllm-openai:latest
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"      # Host RAM Request
            nvidia.com/gpu: "1" # GPU Request
          limits:
            cpu: "8"
            memory: "32Gi"      # Host RAM Limit (provide headroom!)
            nvidia.com/gpu: "1"

Monitoring Both Planes with Prometheus

To prevent these issues, you need to monitor both host memory and GPU memory. Using the NVIDIA DCGM Exporter alongside Kube-State-Metrics is the industry standard for AI infrastructure observability.

# Alert if Host RAM usage is > 90% of limit
(container_memory_working_set_bytes{container!=""} / kube_pod_container_resource_limits{resource="memory"}) > 0.9

# Monitor GPU VRAM Utilization
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL

Memory-Aware Batching in Python

If your preprocessing logic is intensive, it can spike host RAM before data ever reaches the GPU. Implementing a simple memory check before expanding batches can prevent OOMKilled events.

import psutil

def should_process_next_batch(threshold_percent=85):
    # Check host RAM usage to prevent OOMKilled
    mem = psutil.virtual_memory()
    if mem.percent > threshold_percent:
        print(f"Host memory high ({mem.percent}%). Throttling batch...")
        return False
    return True

# In your inference loop
if should_process_next_batch():
    # process large tensor batch
    pass

Scaling and Prevention

As you implement GPU autoscaling, ensure your scaling triggers account for both VRAM saturation and system memory pressure. Many teams discover that their pods crash during scale-up because the model-loading phase exceeds the initial memory requests.

Final Takeaway

Debugging GPU pods in Kubernetes requires a dual-plane approach. You must monitor and tune for both host RAM and GPU VRAM independently. Most OOMKilled events are caused by insufficient host RAM during model loading or large request buffering, not by the GPU itself.

Resilio Tech specializes in hardening Kubernetes clusters for AI workloads. We help companies eliminate "flaky" GPU pods by optimizing resource limits, implementing robust observability, and ensuring your MLOps stack is resilient to memory spikes.

Tired of random GPU pod crashes? Talk to Resilio Tech for a deep-dive audit of your Kubernetes AI infrastructure.

Kubernetes OOMKilled on GPU Pods: Causes, Debugging, and Prevention

The Distinction: OOMKilled vs. CUDA OOM

Technical Implementation: Kubernetes Resource Configuration

Monitoring Both Planes with Prometheus

Memory-Aware Batching in Python

Scaling and Prevention

Final Takeaway

Share this article

Resilio Tech Team

Article Info

Continue Reading

Migrating from Single-GPU to Multi-Node Inference: When Your Model Outgrows One Machine

Serving Open-Source LLMs with vLLM on Kubernetes

Mixture of Experts (MoE) Models in Production: Infrastructure Challenges and Solutions

Ready to move from notebook to production?