Model Deployment
5 min read
Capacity Planning for LLM Inference: GPU Memory, Throughput, and SLA Targets
A practical framework for LLM inference capacity planning, including token demand forecasting, GPU memory budgets, queueing, batching tradeoffs, and how to plan for user-facing latency targets.