Lesson 21: Optimizing Reduction: sequential addressing

In the previous lesson we built a tree reduction in shared memory: each step has half the threads add a pair of values, shrinking the sum down to a single value. But the way we pick which threads are active and which indices they touch dramatically changes performance. In the naive, interleaved-addr

Imagine 8 kids in a row who must add up numbers. In the bad method you ask kid 0, 2, 4, and 6 to work — one works, one rests, messy. In the good method you ask only the first 4 kids, each adding the number of the kid across from them in the second half of the row. Same sum, but one orderly group works while the other rests — no mess.

interleaved addressing: A reduction scheme where the stride grows (s *= 2) and the active thread is 2*s*tid, so active threads are scattered — causing warp divergence and bank conflicts.
sequential addressing: A reduction scheme with a reversed loop (s >>= 1) where each thread with tid < s does tile[tid] += tile[tid+s], so active threads are contiguous — no divergence, no conflicts.
bank conflict: When threads in the same warp access the same shared-memory bank, the accesses serialize instead of running in parallel — hurting performance.
active threads: The threads doing work in a given reduction step. In sequential addressing they are a contiguous 0..s-1; in interleaved they are scattered.