Lesson 14: Global Memory Coalescing — the concept

Global memory (device DRAM) is the largest but also the slowest memory on the GPU, and the hardware reads it in transactions of a contiguous address block — typically 128 bytes at a time. When all 32 threads in a warp access consecutive addresses, for example a[i] with i = the global index, a single

Imagine a watering can serving 32 seedlings. If the seedlings sit in one tight row, one pour wets them all. If each seedling is 32 steps from the next, you need 32 separate trips to the can for the same 32 seedlings — the same water, 32 times the work.

coalescing: When consecutive threads in a warp access consecutive addresses, so the hardware serves them all in a single memory transaction.
memory transaction: A contiguous address block (typically 128 bytes) that the hardware reads or writes in one operation against global memory.
stride: The distance between the addresses of two neighboring threads. A stride of 1 is contiguous (coalesced); a large stride scatters the accesses.
bandwidth: The amount of useful bytes moved to or from memory per unit time. Scattered access wastes it because each transaction brings unneeded bytes.