Lesson 33: Async Transfers & Pinned Memory

For a copy to overlap compute, it is not enough to use cudaMemcpyAsync and several streams — the host-side memory must also be pinned. Ordinary memory allocated with malloc is pageable: the operating system may move it in physical memory, so the GPU cannot copy from it directly via DMA. The driver m

Pageable memory is a library book that can be moved from shelf to shelf at any moment — a courier cannot grab it directly. Pinned memory is a book placed on a fixed stand by the door, and the courier snatches it without waiting.

pinned memory (page-locked): Host memory locked in its physical place. The DMA reads from it directly, so it enables a truly async transfer.
pageable memory: Ordinary host memory (malloc) the OS may move. A copy from it routes through a temporary pinned region and blocks.
cudaMallocHost: Allocates pinned host memory. Free it with cudaFreeHost. An alternative is cudaHostAlloc.
DMA: A dedicated copy engine that moves data host/device without occupying the SMs. It needs a stable host address (pinned).