Lesson 3: Your First Kernel — __global__ and <<<>>>
Until now the GPU has been an idea — thousands of tiny workers waiting for a task; now we give them their first job. The code each GPU worker runs is called a kernel: one small function that all the threads run together, each on its own data. You mark it with the __global__ keyword, it always return
A kernel is like a recipe you pin on the board of a giant kitchen. You (the CPU) do not cook — you just pin the recipe and shout <<<2 tables, 4 cooks>>>. The eight cooks start working, and you walk off to do something else without waiting for them to finish.
- __global__
- Marks a kernel: a function that runs on the GPU and is launched from the CPU. Always returns void.
- launch configuration <<<>>>
- The kernel<<<numBlocks, threadsPerBlock>>>(args) syntax that sets how many blocks and how many threads per block.
- grid and block
- A block is a group of threads; a grid is the set of blocks for one launch. Total threads = numBlocks times threadsPerBlock.
- asynchronous launch
- The CPU does not wait for the kernel to finish; it continues immediately. Explicit synchronization is needed before reading results.