Lesson 7: Host/Device Memory: cudaMalloc, cudaMemcpy

The CPU (host) and the GPU (device) have separate memories. A pointer from malloc lives in host memory, and a pointer from cudaMalloc lives in device memory — and you must not access device memory directly from host code. So the full workflow of a GPU computation is always the same five steps: (1) c

The host and the device are like two offices in different cities. You cannot read a document in the other office through the window — you have to mail it (cudaMemcpy) over, let them work on it, and mail the result back. cudaMalloc is renting an empty desk in the other office, and cudaFree is clearing it out at the end.

cudaMalloc: Allocates memory on the device and returns a device pointer. Like malloc, but the memory lives on the GPU.
cudaMemcpy: Copies bytes between host and device. The direction is set by cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost.
H2D and D2H: Host-to-Device uploads input to the GPU; Device-to-Host brings results back to the CPU. Two opposite directions.
cudaFree: Frees device memory allocated by cudaMalloc. The counterpart of free for host memory.