Lesson 27: Tiled Matrix Multiplication

Matrix multiplication C = A * B computes every element C[row][col] as a sum of products along the K dimension: C[row][col] = sum over k of A[row][k] * B[k][col]. In the naive version, each thread responsible for one element of C reads a whole row of A and a whole column of B directly from global mem

Imagine assembling a dish and repeatedly running to the distant warehouse for the very same ingredients. Instead, you bring a whole crate once to the kitchen counter (the nearby shelf), and all the cooks pull from it fast. Tiling is exactly that: you bring a tile of data once into the nearby shared memory, and reuse it there again and again.

tiled matmul: A matmul algorithm that loads tiles of A and B into shared memory, synchronizes, and accumulates from there — saving global memory reads.
tile: A TILE x TILE sub-block of the matrix, loaded at once into shared memory and used by all threads in the block.
shared-memory reuse: Each element loaded into shared memory is read by TILE different threads, so one global read serves many products.
K dimension: The shared dimension in the product: A is M x K and B is K x N. The sum over k runs along K and is computed tile by tile.