Lesson 25: Parallel Scan — the Doubling Kernel
Last lesson we saw what a prefix sum computes and the doubling idea: each round every element adds the neighbor d away, and d doubles (1, 2, 4...). Here we build that as a real CUDA kernel. We work inside a single block with a shared-memory array (__shared__) where one thread owns one position. The
Each thread owns one position in the shared array. On each step it first reads the value d places to its left and sets it aside, waits for everyone to have read (__syncthreads), only then adds it to itself, and again waits for everyone to have written before the next step. The waits are what keep the order and prevent overwriting.
- Hillis-Steele
- The doubling scan kernel in log2(n) steps: on step d, each thread with tid>=d adds the value from tid-d. d doubles each step.
- shared-memory tile
- A __shared__ array for the whole block. The scan loads the input into it, works in place across all steps, and finally writes to out.
- read/write barriers
- __syncthreads() after the read (so all read before any write) and after the write (so writes are visible to the next step).
- step count log2(n)
- The loop for(d=1; d<n; d*=2) runs log2(n) times, because d doubles each step until it reaches n.