Lesson 13: The Memory Hierarchy
A GPU has several kinds of memory that differ in speed, scope, and lifetime. Registers are the fastest memory: every simple local variable in a kernel lives in a register, private to a single thread and existing only while the thread runs. Shared memory, declared with __shared__, lives on-chip and i
Think of a kitchen: registers are your hands — the fastest, but only yours and only for a moment. Shared memory is the shared counter of one team in the kitchen — fast and reachable by the whole team, but not by a team in another kitchen. Global memory is the giant fridge in the warehouse — huge and serving everyone, but the walk there is slow. Constant memory is the recipe board on the wall that everyone reads but nobody changes.
- registers
- The fastest memory, private to a single thread. Every simple local variable in a kernel lives here and exists only while the thread runs.
- shared memory
- On-chip memory declared __shared__, shared by all threads in the same block. Far faster than global, but does not cross blocks.
- global memory
- The device DRAM: large, slow, accessible by the whole grid and persistent between launches. Data is copied here from the host.
- constant memory
- Read-only memory declared __constant__ and cached. Excellent when all threads read the same value.