Lesson 36: Capstone — End-to-End Kernel Optimization
This is the capstone lesson of the course. We will not learn a new mechanism — we will connect everything we already learned into one process: how to take a naive, slow kernel and make it fast. The checklist is not magic, it is a sequence of checks we already saw separately. First, coalescing: ensur
Improving a kernel is like renovating a house: you can't just paint and hope. You check what is truly broken — the plumbing, the wiring, or the roof — and fix that. And there is one pressure gauge you must not ignore: measurement (Nsight). Without measuring, you fix the wrong wall.
- optimization checklist
- A sequence of checks to improve a kernel: coalescing, shared memory, reducing divergence, tuning occupancy, and measuring in Nsight.
- bottleneck
- The real limit slowing the kernel. Fixing something else will not help until you address it specifically.
- measure first
- Use Nsight to identify the bottleneck before changing code, instead of guessing which optimization is needed.
- global-read reduction
- Staging reused data into shared memory cuts reads from slow global memory, e.g. from K to K/TILE with tiling.