Lesson 8: The Bottleneck — Compute-bound vs Memory-bound
Before optimizing — you ask: what's actually holding things up? Every kernel is either compute-bound (limited by FLOPs, the GPU computing at full power) or memory-bound (limited by memory bandwidth, the GPU waiting for data). Optimizing without this diagnosis is guessing. In this lesson we learn to
Compute-bound is a fast chef who waits because there aren't enough ovens — the bottleneck is the work itself. Memory-bound is idle ovens but ingredients that don't arrive from storage fast enough — the bottleneck is the delivery. Each problem needs a different fix.
- Compute-bound
- The kernel is limited by compute (FLOPs). The GPU is fully busy; to speed up — lower precision or better kernels.
- Memory-bound
- The kernel is limited by memory bandwidth. The GPU waits for data; to speed up — fusion and less memory traffic.
- Arithmetic intensity
- The ratio of FLOPs per byte read/written. Low → memory-bound; high → compute-bound.