Skip to main content
0%
Model Deployment

How We Cut GPU Costs by 40% Without Sacrificing Model Performance

Practical strategies for reducing GPU infrastructure costs — covering spot instances, GPU scheduling, model optimization, and right-sizing — without degrading inference quality.

7 min read1,373 words

GPU costs are typically the largest line item in any AI infrastructure budget. A single A100 instance on AWS costs over $30/hour on-demand. Run a few models across dev, staging, and production environments, and you're looking at five-figure monthly bills before you've served a single customer.

We've helped multiple teams slash their GPU spend by 30-50% using a combination of scheduling, spot instances, model optimization, and right-sizing. Here's the playbook.

The GPU Cost Problem

Most teams approach GPU infrastructure the same way they approach CPU — provision for peak, run 24/7, and hope finance doesn't notice.

The result:

  • GPUs sitting idle 60-80% of the time during off-peak hours
  • Oversized instances because "we might need the VRAM later"
  • No separation between training and inference workloads
  • On-demand pricing for workloads that could tolerate interruption

Strategy 1: Spot/Preemptible Instances for Training

Training workloads are inherently batch-oriented and can be checkpointed. This makes them perfect for spot instances, which run 60-90% cheaper than on-demand.

Setting Up Fault-Tolerant Training on Spot

# training/checkpointed_trainer.py
import torch
import signal
import sys
from pathlib import Path

class SpotAwareTrainer:
    """Training wrapper that handles spot instance interruptions."""

    def __init__(self, model, checkpoint_dir: str, checkpoint_interval: int = 100):
        self.model = model
        self.checkpoint_dir = Path(checkpoint_dir)
        self.checkpoint_interval = checkpoint_interval
        self.current_epoch = 0
        self.current_step = 0

        # Handle SIGTERM (spot termination warning)
        signal.signal(signal.SIGTERM, self._handle_termination)

        # Resume from latest checkpoint if available
        self._restore_checkpoint()

    def _handle_termination(self, signum, frame):
        """Save checkpoint on spot termination (2 min warning)."""
        print("⚠️ Spot termination notice received. Saving checkpoint...")
        self._save_checkpoint(reason="spot_termination")
        sys.exit(0)

    def _save_checkpoint(self, reason: str = "scheduled"):
        checkpoint = {
            "epoch": self.current_epoch,
            "step": self.current_step,
            "model_state": self.model.state_dict(),
            "optimizer_state": self.optimizer.state_dict(),
            "reason": reason,
        }
        path = self.checkpoint_dir / f"checkpoint_e{self.current_epoch}_s{self.current_step}.pt"
        torch.save(checkpoint, path)
        print(f"Checkpoint saved: {path}")

    def _restore_checkpoint(self):
        checkpoints = sorted(self.checkpoint_dir.glob("checkpoint_*.pt"))
        if checkpoints:
            latest = checkpoints[-1]
            checkpoint = torch.load(latest)
            self.model.load_state_dict(checkpoint["model_state"])
            self.current_epoch = checkpoint["epoch"]
            self.current_step = checkpoint["step"]
            print(f"Resumed from {latest} (epoch {self.current_epoch}, step {self.current_step})")

Kubernetes Node Pool for Spot GPUs

# k8s/spot-gpu-nodepool.yaml
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-spot-training
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values:
        - g5.xlarge    # 1x A10G — good for fine-tuning
        - g5.2xlarge   # 1x A10G + more CPU/RAM
        - p3.2xlarge   # 1x V100 — good price/performance
    - key: kubernetes.io/arch
      operator: In
      values: ["amd64"]
  limits:
    resources:
      nvidia.com/gpu: "8"  # max 8 GPUs across spot nodes
  ttlSecondsAfterEmpty: 60
  consolidation:
    enabled: true

Cost impact: Training costs reduced by 60-70%.

Don't Use Spot for Inference — Spot instances can be terminated with 2 minutes' notice. For training, that's fine — you checkpoint and resume. For real-time inference, that means dropped requests. Keep inference on on-demand or reserved instances.

Strategy 2: GPU Time-Sharing and Scheduling

Most inference models don't need a full GPU. A text classification model might use 2GB of a 24GB A10G. Without GPU sharing, you're paying for 22GB of unused VRAM.

NVIDIA MPS for GPU Sharing

# k8s/gpu-sharing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-mps-config
data:
  startup.sh: |
    #!/bin/bash
    # Enable Multi-Process Service for GPU sharing
    export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
    export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
    nvidia-cuda-mps-control -d
    echo "MPS daemon started"

---
# Deploy multiple small models on a single GPU
apiVersion: apps/v1
kind: Deployment
metadata:
  name: small-model-a
spec:
  template:
    spec:
      containers:
        - name: model
          resources:
            limits:
              nvidia.com/gpu-shared: "1"  # using GPU sharing plugin
            requests:
              memory: "2Gi"

Time-Sliced Scheduling

For workloads that are bursty (high traffic during business hours, low at night), scale GPU replicas based on time:

# k8s/scheduled-scaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: recommendation-model
  minReplicas: 1  # overnight minimum
  maxReplicas: 10
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # wait 10 min before scaling down
      policies:
        - type: Pods
          value: 2
          periodSeconds: 300
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
  metrics:
    - type: Pods
      pods:
        metric:
          name: inference_requests_per_second
        target:
          type: AverageValue
          averageValue: "50"

Cost impact: 3-5 small models on one GPU instead of 3-5 separate GPU instances = 60-80% savings on serving costs.

Strategy 3: Model Optimization

Sometimes the cheapest GPU is the one you don't need. Model optimization techniques can reduce resource requirements dramatically.

Quantization: INT8 Without Quality Loss

# optimization/quantize_model.py
from optimum.onnxruntime import ORTQuantizer, AutoQuantizationConfig
from transformers import AutoTokenizer

def quantize_model(model_path: str, output_path: str):
    """Quantize a model from FP32 to INT8, reducing size by ~4x."""

    quantizer = ORTQuantizer.from_pretrained(model_path)

    qconfig = AutoQuantizationConfig.avx512_vnni(
        is_static=False,  # dynamic quantization — no calibration data needed
        per_channel=True,
    )

    quantizer.quantize(
        save_dir=output_path,
        quantization_config=qconfig,
    )

    print(f"Quantized model saved to {output_path}")
    print(f"Original size: {get_model_size(model_path):.1f} MB")
    print(f"Quantized size: {get_model_size(output_path):.1f} MB")

Results we've seen across deployments:

TechniqueSize ReductionLatency ImprovementQuality Impact
INT8 Dynamic Quantization3-4x smaller2-3x faster< 1% accuracy loss
FP16 Mixed Precision2x smaller1.5-2x fasterNegligible
Knowledge Distillation5-10x smaller3-5x faster1-3% accuracy loss
ONNX RuntimeSame size1.5-2x fasterNone

Batching: Amortize GPU Overhead

Dynamic batching collects multiple inference requests and processes them together:

# serving/dynamic_batcher.py
import asyncio
from collections import deque
from dataclasses import dataclass, field
from typing import Any

@dataclass
class InferenceRequest:
    input_data: Any
    future: asyncio.Future = field(default_factory=lambda: asyncio.get_event_loop().create_future())

class DynamicBatcher:
    """Collect requests and batch them for efficient GPU utilization."""

    def __init__(self, model, max_batch_size: int = 32, max_wait_ms: float = 50):
        self.model = model
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms
        self.queue: deque[InferenceRequest] = deque()
        self._running = True

    async def predict(self, input_data):
        """Submit a single prediction request."""
        request = InferenceRequest(input_data=input_data)
        self.queue.append(request)

        # If batch is full, process immediately
        if len(self.queue) >= self.max_batch_size:
            await self._process_batch()

        return await request.future

    async def _process_batch(self):
        batch = []
        while self.queue and len(batch) < self.max_batch_size:
            batch.append(self.queue.popleft())

        if not batch:
            return

        # Process entire batch in one GPU call
        inputs = [r.input_data for r in batch]
        results = self.model.batch_predict(inputs)

        for request, result in zip(batch, results):
            request.future.set_result(result)

Cost impact: Batching increases throughput by 3-8x per GPU, meaning fewer instances needed.

Strategy 4: Right-Sizing GPU Instances

Not every model needs an A100. Here's a decision matrix:

WorkloadRecommended GPUVRAMMonthly Cost (approx)
Small model inference (< 1B params)T416 GB$250-400
Medium model inference (1-7B params)A10G24 GB$500-800
Large model inference (7-13B params)A100 40GB40 GB$1,500-2,500
Fine-tuning (< 7B params)A10G24 GB$500-800
Pre-training / heavy fine-tuningA100 80GB80 GB$2,500-4,000

Profiling Before Provisioning

# Profile actual GPU utilization before right-sizing
nvidia-smi dmon -s um -d 5 -o TD | tee gpu_profile.log

# Check peak VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total,utilization.gpu \
  --format=csv -l 5

If your model peaks at 8GB VRAM and averages 30% GPU utilization, you don't need an A100. Drop to a T4 and save 80%.

Results Summary

Combining all four strategies on a typical AI startup's infrastructure:

StrategySavingsEffort
Spot instances for training60-70% on trainingMedium
GPU sharing for small models60-80% on servingMedium
Model quantization (INT8)50-75% on servingLow
Right-sizing instances30-50% across allLow
Combined40-60% total GPU spend

The 40% in our title is actually conservative. Teams that implement all four strategies often see 50%+ reductions.

Where to Start

  1. Profile first: Run nvidia-smi monitoring for a week. You'll be surprised how much GPU time is wasted.
  2. Spot for training: This is usually the quickest win — implement checkpointing and switch training nodes to spot.
  3. Quantize serving models: INT8 quantization is often a one-line change with negligible quality impact.
  4. Right-size: Match GPU type to actual VRAM and compute needs.

Don't try to implement everything at once. Pick the strategy that matches your biggest cost center and start there.


Struggling with GPU costs? We help teams optimize their AI infrastructure spend without sacrificing model performance. Book a free audit and we'll identify your top cost-saving opportunities.

Share this article

Help others discover this content

Share with hashtags:

#Gpu Optimization#Cost Optimization#Model Deployment#Kubernetes#Spot Instances
RT

Resilio Tech Team

Building AI infrastructure tools and sharing knowledge to help companies deploy ML systems reliably.

Article Info

Published3/20/2026
Reading Time7 min read
Words1,373