Lesson 10: Dynamic Batching & Serving — Grouping Requests in Real Time

In production, requests don't arrive in a neat batch — they trickle in one by one from different users. Run each alone and throughput collapses. The solution in inference servers like Triton: dynamic batching — wait a tiny window, accumulate the requests that arrived, and run them together. In this

An elevator that leaves instantly with one passenger wastes trips. Dynamic batching is waiting a few seconds for more people to step in — not too long, so no one gets annoyed — then riding with a full elevator.

Dynamic batching: The server groups separate requests arriving within a short window into one batch, up to a size or time limit.
Max queue delay: How long the server is willing to wait to accumulate a batch before running. Tunes the latency/throughput balance.
Model instances: Multiple copies of the model running in parallel on the same GPU to serve requests concurrently.