Lesson 7: Measure Correctly — Warmup, synchronize, and Percentiles
The first number you measure will almost always be a lie. The GPU runs asynchronously, so time.time() captures only the launch, not the compute. The first run pays for allocations and autotuning. And a single average hides the tail users actually feel. In this lesson we learn the three rules of trus
Timing the GPU is like photographing a moving train: press too early and you get an empty frame. You must wait for it to arrive (synchronize), ignore the warm-up runs (warmup), and look at the slow trains too (percentiles), not just the average.
- torch.cuda.synchronize()
- Waits for all GPU work to finish. Required before and after timing, because CUDA calls are asynchronous.
- Warmup
- Running a few iterations before measuring to get past one-time costs (allocations, autotuning, caching).
- Percentiles (p50/p99)
- p99 is the time 99% of requests meet. It reveals the slow tail that an average hides.