Lesson 16: Shared Memory Basics

Shared memory is a small, fast memory that sits on the SM's chip and serves as a scratchpad for all threads in the same block. You declare it with the keyword __shared__, for example __shared__ float tile[256];, and it is roughly 100 times faster than global memory — but small (tens of KB per block,

Global memory is a huge, distant warehouse. Shared memory is a small drawer next to the team's desk: you fetch from the warehouse once, put it in the drawer, and everyone on the team pulls from there quickly. Each team (block) gets its own separate drawer.

shared memory: Fast on-chip memory shared by all threads in the same block, declared with __shared__. Much faster than global but small.
scratchpad: A temporary workspace where the block stores data read again and again, to avoid going to slow global memory each time.
block scope: The scope of shared memory: only for the block that created it, and only while that block runs. Other blocks do not see it.
data reuse: When the same value is needed many times. Then it pays to load it once into shared memory and read it from there cheaply.