Lesson 20: Parallel Reduction: tree-based sum
A reduction collapses an entire array into a single value — for example the sum of all elements. On a CPU you do this in a sequential loop: one accumulator walking over the n elements in n steps. But that wastes the GPU: one thread works while thousands sit idle. The parallel solution is a tree redu
Imagine 8 people who must add 8 numbers. Instead of one person adding them all one by one, each pair adds its two numbers at the same time — 4 remain. Again each pair adds — 2 remain. And again — one remains. Three rounds instead of seven.
- reduction
- Collapsing an array into a single value with an associative operation (sum, max, etc.). On a GPU it is done as a tree.
- tree reduction
- At each step half the threads add pairs; the depth is log2(n) steps instead of n.
- stride
- The distance between the two elements added in a given step. Starts large and halves each step down to 1.
- sync between steps
- __syncthreads() after each step guarantees all writes finished before the next step reads them — without it there is a race condition.