Saturday, 11 October 2014

Memories from CUDA - Pinned memory (III)

The main motivation for using pinned memory is to perform asynchronous transfers of data from the host to the device. This is accomplished using cudaMemcpyAsync and related functions.  Additionally, certain performance benefits come with pinned (or page-locked) memory and additional performance benefit can be obtained by using write-combined memory in certain cases. In this post we give a few examples about how to allocate pinned memory and we investigate its features.

About pinned memory

How pinned memory works is maybe best described in this post on http://devblogs.nvidia.com which I'm quoting here:

Host (CPU) data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below.

Here, we first start with a simple example of allocation of pinned memory using cudaMallocHost or cudaHostAlloc which will motivate the use of pinned memory. Let us first introduce cudaMallocHost; the function used to allocated page-locked host memory through the following simple example:


The above code is expected to execute fast compared to the case where malloc was used for host-side allocation. Pinned memory, however, cannot be used in every single case since "page-locked memory is a scarce resource" as NVIDIA puts it in the CUDA programming guide. The main take-home message here is that cudaMallocHost allocated page-locked host memory, while cudaMalloc allocates memory on the device. Note also that the host memory is freed with cudaFreeHost, while the device memory is freed using cudaFree.



A first example

Putting everything together, we have the following very simple example which involves memory allocation using cudaHostAlloc and cudaMalloc for the host and device variables respectively, a kernel invocation and, finally, we free the allocated memory.


Forget about cudaMemcpy

There is another interesting feature of pinned memory: although it is allocated on the host, it is accessible from the device! The official documentation says the cudaHostAlloc "Allocates size bytes of host memory that is page-locked and accessible to the device". Let us give an example of how this is done by passing the device address of the variable that has been allocated using cudaHostAlloc directly to a kernel function. To do so, we need to explicitly define that our host allocation should be mapped using the flag cudaHostAllocMapped. Here is an example of use:



Notice that there is no cudaMemcpy involved, i.e., there is no explicit data transfer from the host to the device. Variable host_p is allocated as a page-locked variable host-side and then data is loaded directly onto it (as if we had used malloc). The kernel function is launched passing the address of this same variable on the device which is retrieved using cudaHostGetDevicePointer.

One little detail here is that cudaDeviceSynchronize must be called after the kernel execution to make sure that any changes that have been done one the variable from the device will be "synchronized" with the host. Finally, we print host_p with standard host-side code.



Does your device support it?

There's one more thing: does your device support host memory mapping? If yes, then make sure it is activated before you try out the code above. To do so, first we need to query the device (using cudaGetDeviceProperties) and then to set the device flags to cudaDeviceMapHost. Here is how your main function should start:



3 comments: