Lesson 25: Parallel Scan — the Doubling Kernel

Last lesson we saw what a prefix sum computes and the doubling idea: each round every element adds the neighbor d away, and d doubles (1, 2, 4...). Here we build that as a real CUDA kernel. We work inside a single block with a shared-memory array (__shared__) where one thread owns one position. The

Each thread owns one position in the shared array. On each step it first reads the value d places to its left and sets it aside, waits for everyone to have read (__syncthreads), only then adds it to itself, and again waits for everyone to have written before the next step. The waits are what keep the order and prevent overwriting.

Hillis-Steele: The doubling scan kernel in log2(n) steps: on step d, each thread with tid>=d adds the value from tid-d. d doubles each step.
shared-memory tile: A __shared__ array for the whole block. The scan loads the input into it, works in place across all steps, and finally writes to out.
read/write barriers: __syncthreads() after the read (so all read before any write) and after the write (so writes are visible to the next step).
step count log2(n): The loop for(d=1; d<n; d*=2) runs log2(n) times, because d doubles each step until it reaches n.