Debugging GPU Utilization: Why Your Expensive GPUs Are Sitting at 20%

Low GPU utilization usually gets blamed on the wrong layer.

Teams assume the model is too small, the hardware is oversized, or the cloud bill is just the price of doing ML. In practice, low gpu utilization ml is often a pipeline problem. The GPU is waiting on something else:

CPU preprocessing
tokenization
synchronous input loading
tiny batches
framework overhead
network or storage delays

That is why debug gpu utilization inference should start with one rule:

do not optimize the GPU first; prove what the GPU is waiting on

If a GPU is sitting at 20 percent, the question is not “how do I make CUDA faster?” The question is “what keeps the device idle between kernels?”

This post gives you a practical debugging flow using nvidia-smi, system metrics, and workload profiling so you can find the real bottleneck before making expensive hardware or architecture changes.

Step 1: Confirm That the GPU Is Actually Underutilized

Do not rely on one screenshot from a dashboard.

You want a time series over a representative window. Start with nvidia-smi at short intervals:

nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.total,power.draw --format=csv -l 1

What to look for:

low utilization.gpu with low memory usage: probably small model, small batch, or not much work
low utilization.gpu with high memory usage: model is resident, but kernels are not running consistently
spiky utilization pattern: the GPU is being fed in bursts, then waiting
high memory utilization but low compute: often batch sizing, data stalls, or inefficient kernels

This matters because “20 percent utilization” can mean different things:

the workload is naturally bursty
one request path is too small to saturate the card
the model server is starved by the CPU
the input pipeline cannot keep up

If you skip that distinction, you start tuning blind.

Step 2: Check Whether the GPU Is Waiting on the CPU

This is one of the most common causes of gpu underutilized machine learning.

The GPU can be fine while the host is overloaded by:

tokenization
image transforms
feature normalization
JSON parsing
Python overhead
synchronous request handling

Look at host-side signals at the same time as GPU metrics:

top
vmstat 1
iostat -xz 1

In containerized environments, also check pod CPU throttling and per-container usage. A very common pattern is:

GPU utilization stays low
one or two CPU cores are pegged
request queue grows
throughput plateaus

That usually means your “GPU problem” is a preprocessing problem.

For inference systems, tokenization and request formatting are frequent culprits. For training jobs, image augmentation or data decoding often dominates.

Step 3: Identify Preprocessing Bottlenecks

If the CPU is suspicious, profile preprocessing directly.

Examples of expensive stages:

resizing images one at a time in Python
converting formats on every request
per-record feature joins
tokenization done synchronously in the web process
reading compressed inputs from slow object storage

A fast way to validate this is to time the stages around the actual GPU call:

import time

def predict(request):
    t0 = time.perf_counter()
    features = preprocess(request)
    t1 = time.perf_counter()
    outputs = model_infer(features)
    t2 = time.perf_counter()
    result = postprocess(outputs)
    t3 = time.perf_counter()

    return {
        "result": result,
        "timing_ms": {
            "preprocess": (t1 - t0) * 1000,
            "infer": (t2 - t1) * 1000,
            "postprocess": (t3 - t2) * 1000,
            "total": (t3 - t0) * 1000,
        },
    }

If preprocessing is taking longer than inference, the GPU is not the bottleneck even if the workload feels “slow.”

Fixes usually include:

moving transforms into compiled libraries
batching preprocessing instead of doing it per item
caching reusable intermediate outputs
tokenizing asynchronously
precomputing stable features upstream

If you are serving LLMs, this often overlaps with the latency problems described in Why Your LLM Responses Are Slow, but the focus here is utilization: the device cannot stay busy if the host cannot prepare work fast enough.

Step 4: Tune Batch Size Before Buying More GPUs

Small batches are one of the fastest ways to waste expensive hardware.

If each request reaches the GPU alone, kernel launches, memory movement, and framework overhead dominate too much of the total path. You end up paying for GPU residency without giving the device enough parallel work.

Start by measuring throughput and latency at multiple batch sizes:

1
2
4
8
16
32

Track:

GPU utilization
latency P50/P95
memory usage
requests per second or items per second

The right batch size is not “as large as possible.” It is the largest size that improves throughput without breaking latency or memory constraints.

Typical signs you are under-batched:

GPU utilization stays low even under demand
memory headroom is still large
throughput increases sharply as batch size rises
CPU overhead per request is high relative to compute time

Typical signs you are over-batched:

queueing delay jumps
tail latency gets much worse
memory pressure rises sharply
throughput flattens instead of improving

For more on the throughput-latency tradeoff, see Batching Strategies for LLM Inference.

Step 5: Fix the Input Pipeline With Async Loading

In training and offline inference, low GPU utilization often comes from the data loader, not the model runtime.

Classic symptoms:

the GPU runs briefly, then waits
disk or network IO is uneven
workers spend time decoding or fetching
training step time has long gaps

This is where asynchronous loading, prefetching, and pinned memory matter.

Example in PyTorch:

from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=8,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=4,
)

What these settings help with:

num_workers: parallelize input decoding and preparation
pin_memory: improve host-to-device transfer behavior
persistent_workers: avoid repeated worker startup cost
prefetch_factor: keep future batches ready

This is not magic. You still need to test whether storage, decoding, or network fetch is the limiting factor. But many teams leave the defaults untouched and then wonder why the GPU spends half its time waiting for the next batch.

For Kubernetes workloads, also check whether the node has enough local disk, CPU, and network bandwidth to support the loader. A powerful GPU on a weak node profile is a common waste pattern.

Step 6: Verify That the Model Runtime Is Efficient

Sometimes the pipeline is fine and the runtime is the problem.

Common issues:

eager execution where a compiled path would help
unsupported operator patterns
no mixed precision
suboptimal serving runtime
poor kernel fusion
too much framework overhead around small models

Depending on the workload, useful levers include:

FP16 or BF16 inference
quantization
TorchScript or torch.compile
TensorRT or ONNX Runtime
vLLM or TensorRT-LLM for LLM serving

This matters most when you have already confirmed:

preprocessing is not dominant
input loading is healthy
batching is reasonable
the device still does not stay busy

At that point, model runtime optimization is a credible next step. Before that, it is often premature.

Step 7: Separate Utilization Problems From Sizing Problems

Not every low-utilization number is a bug.

Sometimes the GPU is simply too large for the workload. For example:

a small classifier on an oversized card
low-QPS product traffic on a dedicated A100
bursty internal jobs on always-on capacity

That is not a tuning issue. That is a sizing issue.

The distinction matters:

if the GPU is idle because the pipeline is stalled, fix the system
if the GPU is idle because the workload is too small, right-size the hardware or consolidate traffic

This is where utilization debugging connects directly to cost work. Teams often chase runtime micro-optimizations when the better move is simply moving the workload to a smaller card or sharing infrastructure more effectively.

A Practical Debugging Order

If you want a fast field checklist, use this order:

Measure utilization.gpu, memory usage, and duty cycle with nvidia-smi.
Compare GPU behavior with CPU, disk, and network metrics.
Time preprocessing, inference, and postprocessing separately.
Test larger batch sizes within latency and VRAM limits.
Tune async data loading and prefetch behavior.
Optimize the model runtime only after the feed path is healthy.
Revisit hardware sizing if utilization is still naturally low.

That order prevents a lot of wasted effort.

Final Takeaway

Most low gpu utilization ml problems are not fixed by buying bigger GPUs or rewriting the model first. They are fixed by proving what the device is waiting on.

If you are trying to debug gpu utilization inference, focus on the full path:

host preprocessing
batch formation
data loading
runtime efficiency
workload sizing

Once those are measured directly, the answer is usually much less mysterious than it seemed from a single 20 percent utilization graph.

Debugging GPU Utilization: Why Your Expensive GPUs Are Sitting at 20%

Step 1: Confirm That the GPU Is Actually Underutilized

Step 2: Check Whether the GPU Is Waiting on the CPU

Step 3: Identify Preprocessing Bottlenecks

Step 4: Tune Batch Size Before Buying More GPUs

Step 5: Fix the Input Pipeline With Async Loading

Step 6: Verify That the Model Runtime Is Efficient

Step 7: Separate Utilization Problems From Sizing Problems

A Practical Debugging Order

Final Takeaway

Share this article

Resilio Tech Team

Article Info

Continue Reading

How to Run ML Model Benchmarks That Actually Predict Production Performance

Serverless Inference: Can It Actually Work for Production AI Workloads?

The Rise of AI Inference at the Edge: When Cloud GPUs Aren't an Option

Ready to move from notebook to production?