Lesson 27: Tiled Matrix Multiplication
Matrix multiplication C = A * B computes every element C[row][col] as a sum of products along the K dimension: C[row][col] = sum over k of A[row][k] * B[k][col]. In the naive version, each thread responsible for one element of C reads a whole row of A and a whole column of B directly from global mem
Imagine assembling a dish and repeatedly running to the distant warehouse for the very same ingredients. Instead, you bring a whole crate once to the kitchen counter (the nearby shelf), and all the cooks pull from it fast. Tiling is exactly that: you bring a tile of data once into the nearby shared memory, and reuse it there again and again.
- tiled matmul
- A matmul algorithm that loads tiles of A and B into shared memory, synchronizes, and accumulates from there — saving global memory reads.
- tile
- A TILE x TILE sub-block of the matrix, loaded at once into shared memory and used by all threads in the block.
- shared-memory reuse
- Each element loaded into shared memory is read by TILE different threads, so one global read serves many products.
- K dimension
- The shared dimension in the product: A is M x K and B is K x N. The sum over k runs along K and is computed tile by tile.