Batching is one of the fastest ways to improve GPU efficiency in LLM inference. It is also one of the fastest ways to make an interactive product feel worse if you apply it badly.
That is the core tradeoff:
- bigger batches usually improve throughput
- bigger batches also increase wait time before execution
The right batching policy depends on the product you are running, not just the GPU you want to optimize.
Why Batching Works
GPUs like parallel work. If you process one request at a time, you leave performance on the table.
Batching helps by:
- amortizing launch overhead
- improving hardware utilization
- increasing tokens-per-second output
- reducing cost per request under concurrency
This is especially valuable when requests are frequent and similar in size.
The First Question: Interactive or Batch?
Do not use one batching rule for every workflow.
Good candidates for aggressive batching:
- offline summarization
- large embedding jobs
- nightly classification pipelines
- internal backfill tasks
More conservative batching for:
- interactive chat
- customer-facing assistants
- real-time copilots
If users expect instant responses, the queue delay matters more than the GPU utilization chart.
Static vs Dynamic Batching
Static batching
You collect a fixed number of requests and process them together.
Pros:
- simple
- predictable batch size
Cons:
- idle time while waiting for the batch to fill
- bad fit for bursty or uneven traffic
Dynamic batching
You collect requests for a short window and run whatever is available.
Pros:
- better for variable traffic
- easier to balance throughput and latency
Cons:
- more tuning complexity
For most online LLM systems, dynamic batching is the more practical default.
Tune by Time Window, Not Just Batch Size
Teams often focus only on max_batch_size. That is not enough.
The real control variables are:
- max batch size
- max waiting time
- max concurrent sequences
- request size constraints
class DynamicBatcher:
def __init__(self, max_batch_size=16, max_wait_ms=20):
self.max_batch_size = max_batch_size
self.max_wait_ms = max_wait_ms
self.queue = []
async def next_batch(self):
start = now_ms()
while len(self.queue) < self.max_batch_size:
if now_ms() - start >= self.max_wait_ms:
break
await sleep_ms(1)
return self.queue[:self.max_batch_size]
That max_wait_ms is where user experience often lives or dies.
Separate Long and Short Requests
One long request can hold a short request hostage inside the same batch.
A useful pattern is routing requests into lanes:
- short prompts
- medium prompts
- long prompts
That prevents very heavy requests from degrading the entire interactive workload.
Set Queue Limits Explicitly
Batching without queue limits is how systems look efficient on dashboards while users wait forever.
Define:
- max queue depth
- max wait per class of request
- timeout before admission
If the batcher cannot serve traffic within your latency budget, reject or shed load intentionally rather than pretending the system is fine.
Measure Both Throughput and User Cost
The batching dashboard should show:
- average batch size
- queue wait time
- end-to-end latency
- tokens per second
- GPU utilization
- timeout rate
If you only look at throughput, every batching change looks good. If you only look at latency, every batching change looks bad. You need both views.
Good Default Policies
For interactive LLM APIs:
- modest batch size
- very short wait window
- separate lane for large prompts
- fast queue rejection
For offline jobs:
- larger batch size
- longer wait window
- lower priority scheduling
Different products need different operating points.
Continuous Batching Changes the Tradeoff
Serving runtimes like vLLM make batching more efficient because they can continuously admit work rather than waiting for rigid fixed batches.
That is one reason modern LLM serving systems perform better than naive request queues. But even with continuous batching, request shape and queue policy still matter.
You still need:
- max sequence controls
- prompt length caps
- route-specific policies
Common Mistakes
These show up often:
- batching interactive and offline traffic together
- optimizing for average throughput while p95 latency worsens
- no queue cap
- letting very long prompts share a lane with tiny requests
- tuning only batch size, not wait time
Most batching incidents are policy mistakes, not runtime bugs.
A Practical Starting Point
For many user-facing LLM APIs:
- start with dynamic batching
- keep wait windows very short
- split large prompts into their own lane
- watch p95 latency and queue wait together
For internal or offline systems:
- increase batch size gradually
- accept longer wait windows
- maximize throughput only where latency is not user-visible
Final Takeaway
Batching is not a universal optimization knob. It is a policy decision about how much waiting you are willing to trade for throughput.
The best batching setup is the one that matches the product’s latency expectations, not the one that simply makes the GPU chart look the best.
Need help tuning LLM inference performance? We help teams choose batching, routing, and queueing strategies that improve throughput without quietly wrecking latency. Book a free infrastructure audit and we’ll review your serving setup.


