About pinned memory
How pinned memory works is maybe best described in this post on http://devblogs.nvidia.com which I'm quoting here:Host (CPU) data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below.
Here, we first start with a simple example of allocation of pinned memory using cudaMallocHost or cudaHostAlloc which will motivate the use of pinned memory. Let us first introduce cudaMallocHost; the function used to allocated page-locked host memory through the following simple example:
The above code is expected to execute fast compared to the case where malloc was used for host-side allocation. Pinned memory, however, cannot be used in every single case since "page-locked memory is a scarce resource" as NVIDIA puts it in the CUDA programming guide. The main take-home message here is that cudaMallocHost allocated page-locked host memory, while cudaMalloc allocates memory on the device. Note also that the host memory is freed with cudaFreeHost, while the device memory is freed using cudaFree.
A first example
Putting everything together, we have the following very simple example which involves memory allocation using cudaHostAlloc and cudaMalloc for the host and device variables respectively, a kernel invocation and, finally, we free the allocated memory.Forget about cudaMemcpy
There is another interesting feature of pinned memory: although it is allocated on the host, it is accessible from the device! The official documentation says the cudaHostAlloc "Allocatessize
bytes of host memory that is page-locked and accessible to the device". Let us give an example of how this is done by passing the device address of the variable that has been allocated using cudaHostAlloc directly to a kernel function. To do so, we need to explicitly define that our host allocation should be mapped using the flag cudaHostAllocMapped. Here is an example of use:Notice that there is no cudaMemcpy involved, i.e., there is no explicit data transfer from the host to the device. Variable host_p is allocated as a page-locked variable host-side and then data is loaded directly onto it (as if we had used malloc). The kernel function is launched passing the address of this same variable on the device which is retrieved using cudaHostGetDevicePointer.
One little detail here is that cudaDeviceSynchronize must be called after the kernel execution to make sure that any changes that have been done one the variable from the device will be "synchronized" with the host. Finally, we print host_p with standard host-side code.
great post!
ReplyDeleteThanks a lot!
DeleteVery helpful!
ReplyDelete