Lesson 28: Occupancy: active warps per SM

Each Streaming Multiprocessor (SM) on the GPU can hold only a limited number of active warps at once. Occupancy is defined as the ratio: occupancy = active warps / maximum possible warps per SM. Why does it matter? When one warp runs out of work or waits on memory, the SM can immediately switch to a

Imagine a waiter handling several tables at once. When one table has not finished choosing from the menu, he moves to another table instead of standing and waiting. The more active tables he has, the less time he wastes waiting. But if he has too many tables, each table gets less attention — so more tables is not always better.

occupancy: The ratio of active warps on an SM to the maximum warps the SM supports. High occupancy helps hide memory stalls.
SM: The processing unit on the GPU that runs blocks. Its resources (registers, shared memory) have a limited capacity shared among blocks.
latency hiding: When a warp waits on memory, the SM switches to another warp. More active warps means more work to hide the wait behind.
resource limit: Registers per thread, shared memory per block, and block size — each can limit how many blocks fit on an SM.