Lesson 10: Dynamic Batching & Serving — Grouping Requests in Real Time
In production, requests don't arrive in a neat batch — they trickle in one by one from different users. Run each alone and throughput collapses. The solution in inference servers like Triton: dynamic batching — wait a tiny window, accumulate the requests that arrived, and run them together. In this
An elevator that leaves instantly with one passenger wastes trips. Dynamic batching is waiting a few seconds for more people to step in — not too long, so no one gets annoyed — then riding with a full elevator.
- Dynamic batching
- The server groups separate requests arriving within a short window into one batch, up to a size or time limit.
- Max queue delay
- How long the server is willing to wait to accumulate a batch before running. Tunes the latency/throughput balance.
- Model instances
- Multiple copies of the model running in parallel on the same GPU to serve requests concurrently.