Lesson 7: Measure Correctly — Warmup, synchronize, and Percentiles

The first number you measure will almost always be a lie. The GPU runs asynchronously, so time.time() captures only the launch, not the compute. The first run pays for allocations and autotuning. And a single average hides the tail users actually feel. In this lesson we learn the three rules of trus

Timing the GPU is like photographing a moving train: press too early and you get an empty frame. You must wait for it to arrive (synchronize), ignore the warm-up runs (warmup), and look at the slow trains too (percentiles), not just the average.

torch.cuda.synchronize(): Waits for all GPU work to finish. Required before and after timing, because CUDA calls are asynchronous.
Warmup: Running a few iterations before measuring to get past one-time costs (allocations, autotuning, caching).
Percentiles (p50/p99): p99 is the time 99% of requests meet. It reveals the slow tail that an average hides.