Skip to main content
0%
Model Deployment

Kubernetes OOMKilled on GPU Pods: Causes, Debugging, and Prevention

A tactical guide to debugging OOMKilled GPU pods on Kubernetes, including the difference between CUDA OOM and Kubernetes OOMKilled, memory-limit tuning for model serving, and how to monitor GPU versus system memory.

3 min read507 words

One of the most common debugging mistakes in vLLM on Kubernetes is assuming every memory failure is a GPU issue. It's not. A pod can fail because it exceeded its Kubernetes system memory limit (OOMKilled) or because the model runtime hit CUDA out-of-memory (OOM) on the GPU itself.

If you don't separate these two, you'll spend hours tuning VRAM when the real problem is your host RAM, or vice versa. This is a critical part of capacity planning for LLM inference.

The Distinction: OOMKilled vs. CUDA OOM

  • Kubernetes OOMKilled (Exit Code 137): The Linux kernel killed the container because it exceeded its limits.memory. This is a system RAM issue.
  • CUDA OOM: The application logs show a RuntimeError: CUDA out of memory. This is a VRAM issue on the GPU.

Technical Implementation: Kubernetes Resource Configuration

When defining your deployment, you must provide enough headroom for both CPU and GPU memory. A common mistake is setting memory limits too tight, leading to crashes during model loading.

# Recommended GPU Deployment Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference-server
spec:
  template:
    spec:
      containers:
      - name: inference-engine
        image: vllm/vllm-openai:latest
        resources:
          requests:
            cpu: "4"
            memory: "16Gi"      # Host RAM Request
            nvidia.com/gpu: "1" # GPU Request
          limits:
            cpu: "8"
            memory: "32Gi"      # Host RAM Limit (provide headroom!)
            nvidia.com/gpu: "1"

Monitoring Both Planes with Prometheus

To prevent these issues, you need to monitor both host memory and GPU memory. Using the NVIDIA DCGM Exporter alongside Kube-State-Metrics is the industry standard for AI infrastructure observability.

# Alert if Host RAM usage is > 90% of limit
(container_memory_working_set_bytes{container!=""} / kube_pod_container_resource_limits{resource="memory"}) > 0.9

# Monitor GPU VRAM Utilization
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL

Memory-Aware Batching in Python

If your preprocessing logic is intensive, it can spike host RAM before data ever reaches the GPU. Implementing a simple memory check before expanding batches can prevent OOMKilled events.

import psutil

def should_process_next_batch(threshold_percent=85):
    # Check host RAM usage to prevent OOMKilled
    mem = psutil.virtual_memory()
    if mem.percent > threshold_percent:
        print(f"Host memory high ({mem.percent}%). Throttling batch...")
        return False
    return True

# In your inference loop
if should_process_next_batch():
    # process large tensor batch
    pass

Scaling and Prevention

As you implement GPU autoscaling, ensure your scaling triggers account for both VRAM saturation and system memory pressure. Many teams discover that their pods crash during scale-up because the model-loading phase exceeds the initial memory requests.

Final Takeaway

Debugging GPU pods in Kubernetes requires a dual-plane approach. You must monitor and tune for both host RAM and GPU VRAM independently. Most OOMKilled events are caused by insufficient host RAM during model loading or large request buffering, not by the GPU itself.

Resilio Tech specializes in hardening Kubernetes clusters for AI workloads. We help companies eliminate "flaky" GPU pods by optimizing resource limits, implementing robust observability, and ensuring your MLOps stack is resilient to memory spikes.

Tired of random GPU pod crashes? Talk to Resilio Tech for a deep-dive audit of your Kubernetes AI infrastructure.

Share this article

Help others discover this content

Share with hashtags:

#Kubernetes#Gpu Optimization#Troubleshooting#Model Deployment#Llm Serving
RT

Resilio Tech Team

Building AI infrastructure tools and sharing knowledge to help companies deploy ML systems reliably.

Article Info

Published4/10/2026
Reading Time3 min read
Words507
Scale Your AI Infrastructure

Ready to move from notebook to production?

We help companies deploy, scale, and operate AI systems reliably. Book a free 30-minute audit to discuss your specific infrastructure challenges.