Lesson 32: Streams & Concurrency
A CUDA stream is a queue of operations that execute in the order they were submitted to it. The key rule: operations in the same stream serialize (run in order), but operations in different streams can overlap in time. That is what unlocks the big speedup: while the GPU computes a kernel on one chun
A stream is like a checkout lane: customers in one lane wait in line, one after another. Open two lanes (two streams) and two customers are served at once. If everyone crowds into one lane, there is no concurrency.
- stream
- A queue of GPU operations executed in submission order. Operations in the same stream serialize; in different streams they can overlap.
- default stream (stream 0)
- The stream operations go to when none is specified. It is usually synchronizing and prevents overlap if everything goes through it.
- cudaMemcpyAsync
- A non-blocking memory copy that takes a stream. Lets a copy overlap with compute when used with pinned memory.
- overlap
- Running different operations at the same time — e.g. a copy in one stream while compute runs in another — to cut total time.