Lesson 33: Async Transfers & Pinned Memory
For a copy to overlap compute, it is not enough to use cudaMemcpyAsync and several streams — the host-side memory must also be pinned. Ordinary memory allocated with malloc is pageable: the operating system may move it in physical memory, so the GPU cannot copy from it directly via DMA. The driver m
Pageable memory is a library book that can be moved from shelf to shelf at any moment — a courier cannot grab it directly. Pinned memory is a book placed on a fixed stand by the door, and the courier snatches it without waiting.
- pinned memory (page-locked)
- Host memory locked in its physical place. The DMA reads from it directly, so it enables a truly async transfer.
- pageable memory
- Ordinary host memory (malloc) the OS may move. A copy from it routes through a temporary pinned region and blocks.
- cudaMallocHost
- Allocates pinned host memory. Free it with cudaFreeHost. An alternative is cudaHostAlloc.
- DMA
- A dedicated copy engine that moves data host/device without occupying the SMs. It needs a stable host address (pinned).