Lesson 36: Capstone — End-to-End Kernel Optimization

This is the capstone lesson of the course. We will not learn a new mechanism — we will connect everything we already learned into one process: how to take a naive, slow kernel and make it fast. The checklist is not magic, it is a sequence of checks we already saw separately. First, coalescing: ensur

Improving a kernel is like renovating a house: you can't just paint and hope. You check what is truly broken — the plumbing, the wiring, or the roof — and fix that. And there is one pressure gauge you must not ignore: measurement (Nsight). Without measuring, you fix the wrong wall.

optimization checklist: A sequence of checks to improve a kernel: coalescing, shared memory, reducing divergence, tuning occupancy, and measuring in Nsight.
bottleneck: The real limit slowing the kernel. Fixing something else will not help until you address it specifically.
measure first: Use Nsight to identify the bottleneck before changing code, instead of guessing which optimization is needed.
global-read reduction: Staging reused data into shared memory cuts reads from slow global memory, e.g. from K to K/TILE with tiling.