Lesson 3: Your First Kernel — global and <<<>>>

Until now the GPU has been an idea — thousands of tiny workers waiting for a task; now we give them their first job. The code each GPU worker runs is called a kernel: one small function that all the threads run together, each on its own data. You mark it with the __global__ keyword, it always return

A kernel is like a recipe you pin on the board of a giant kitchen. You (the CPU) do not cook — you just pin the recipe and shout <<<2 tables, 4 cooks>>>. The eight cooks start working, and you walk off to do something else without waiting for them to finish.

__global__: Marks a kernel: a function that runs on the GPU and is launched from the CPU. Always returns void.
launch configuration <<<>>>: The kernel<<<numBlocks, threadsPerBlock>>>(args) syntax that sets how many blocks and how many threads per block.
grid and block: A block is a group of threads; a grid is the set of blocks for one launch. Total threads = numBlocks times threadsPerBlock.
asynchronous launch: The CPU does not wait for the kernel to finish; it continues immediately. Explicit synchronization is needed before reading results.