Capacity planning for LLMs is fundamentally a memory problem. If you don't understand your KV cache requirements, you'll either over-provision and waste infrastructure spend or under-provision and fail your latency SLOs. For unpredictable workloads, you might also consider whether serverless inference is a viable alternative to dedicated capacity.
The Math of GPU Memory
When planning, you must account for:
- Model Weights: ~1.2x the parameter count in FP16 (e.g., Llama 70B needs ~140GB).
- KV Cache: The memory used to store the context for active requests. This scales with context length and batch size.
- Overhead: System and runtime memory.
Modeling Throughput and Latency
Once you have the memory footprint, use load testing to find your saturation point. This helps you define your autoscaling targets to ensure you stay ahead of demand without overspending.
Final Takeaway
Capacity planning for LLMs is an iterative process. By combining theoretical memory modeling with empirical load testing, you can build a fleet that meets your performance SLAs while staying within your budget.
Struggling to right-size your LLM serving fleet? We help teams model their GPU memory requirements, perform load testing, and design cost-effective scaling strategies. Book a free infrastructure audit and we’ll help you plan your capacity.