Lesson 4: Sizing the Grid — Enough Threads for N
Hidden in every GPU launch is a small counting puzzle: you have n items to process, but threads come in fixed-size blocks — so how many blocks cover them all without missing the last few? That is what this lesson is about: making sure there are enough threads. You already know how to declare a kerne
You need to bus 1000 people in buses that seat 256. 1000 divided by 256 is about 3.9, but you cannot order 3.9 buses — you need 4 whole ones, otherwise the last passengers are left behind. The ceiling formula always rounds up to a whole bus. True, the fourth bus has some empty seats — those are the extra threads that simply sit quietly.
- ceiling division
- Division that always rounds up. In integers it is expressed as (n + d - 1) / d to get the number of groups that covers all n elements.
- threads per block
- The number of threads in each block (the second number in <<<>>>). A common value is 256. It is the denominator in the block-count formula.
- grid size
- The number of blocks launched, numBlocks. Chosen so the total threads (numBlocks times threadsPerBlock) is at least n.
- bounds guard
- The if (i < n) condition inside the kernel that makes the surplus threads (those whose index is past n) avoid touching memory.