Lesson 26: Matrix Multiply — the Naive GPU Kernel

Matrix multiplication C = A * B computes every element C[row][col] as a sum of products along the shared dimension K: C[row][col] = sum over k of A[row][k] * B[k][col]. The direct (naive) way to parallelize this on a GPU is to assign one thread to each element of C: the thread computes its row and c

Imagine a huge multiplication grid to fill in. In the naive version each worker is responsible for just one cell, but to fill it they walk to a far warehouse and fetch a whole row and a whole column — and the worker next to them walks to the same warehouse and fetches the very same row again. Lots of wasted trips.

thread-to-element mapping: Each thread owns one element of C. It computes row and col from the 2D indices and produces C[row][col].
the k-loop (dot product): The loop over k that accumulates sum += A[row*K + k] * B[k*N + col]. It is a dot product of a row of A with a column of B.
row-major flat index: A rows x cols matrix is stored row after row, so element [r][c] sits at address r*cols + c.
redundant global reads: In the naive version each element of A and B is read from global memory over and over by different threads, burning bandwidth.