Saturday, 11 October 2014

Memories from CUDA - Pinned memory (III)

The main motivation for using pinned memory is to perform asynchronous transfers of data from the host to the device. This is accomplished using cudaMemcpyAsync and related functions.  Additionally, certain performance benefits come with pinned (or page-locked) memory and additional performance benefit can be obtained by using write-combined memory in certain cases. In this post we give a few examples about how to allocate pinned memory and we investigate its features.

About pinned memory

How pinned memory works is maybe best described in this post on http://devblogs.nvidia.com which I'm quoting here:

Host (CPU) data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below.

Here, we first start with a simple example of allocation of pinned memory using cudaMallocHost or cudaHostAlloc which will motivate the use of pinned memory. Let us first introduce cudaMallocHost; the function used to allocated page-locked host memory through the following simple example:

int * host_p;
int * dev_p;
int main(void) {
int data_size = 4 * sizeof(int);
checkCudaErrors(
cudaHostAlloc((void**)&host_p, data_size, cudaHostAllocDefault));
checkCudaErrors(
cudaMalloc((void**)&dev_p, data_size));
/* Transfer data p --> dev_p */
cudaMemcpy(dev_p, p, data_size, cudaMemcpyHostToDevice);
checkCudaErrors(cudaFreeHost(host_p));
checkCudaErrors(cudaFree(dev_p));
return 0
}

The above code is expected to execute fast compared to the case where malloc was used for host-side allocation. Pinned memory, however, cannot be used in every single case since "page-locked memory is a scarce resource" as NVIDIA puts it in the CUDA programming guide. The main take-home message here is that cudaMallocHost allocated page-locked host memory, while cudaMalloc allocates memory on the device. Note also that the host memory is freed with cudaFreeHost, while the device memory is freed using cudaFree.



A first example

Putting everything together, we have the following very simple example which involves memory allocation using cudaHostAlloc and cudaMalloc for the host and device variables respectively, a kernel invocation and, finally, we free the allocated memory.

#include <stdio.h>
#include <cuda_runtime.h>
#include "helper_cuda.h"
/* A very simple kernel function */
__global__ void kernel(int *d_var) { d_var[threadIdx.x] += 10; }
int * host_p;
int * host_result;
int * dev_p;
int main(void) {
int ns = 4;
int data_size = ns * sizeof(int);
/* Allocate host_p as pinned memory */
checkCudaErrors(
cudaHostAlloc((void**)&host_p, data_size,
cudaHostAllocDefault) );
/* Allocate host_result as pinned memory */
checkCudaErrors(
cudaHostAlloc((void**)&host_result, data_size,
cudaHostAllocDefault) );
/* Allocate dev_p on the device global memory */
checkCudaErrors(
cudaMalloc((void**)&dev_p, data_size) );
/* Initialise host_p*/
for (int i=0; i<ns; i++){
host_p[i] = i + 1;
}
/* Transfer data to the device host_p .. dev_p */
checkCudaErrors(
cudaMemcpy(dev_p, host_p, data_size, cudaMemcpyHostToDevice) );
/* Now launch the kernel... */
kernel<<<1, ns>>>(dev_p);
getLastCudaError("Kernel error");
/* Copy the result from the device back to the host */
checkCudaErrors(
cudaMemcpy(host_result, dev_p, data_size, cudaMemcpyDeviceToHost) );
/* and print the result */
for (int i=0; i<ns; i++){
printf("result[%d] = %d\n", i, host_result[i]);
}
/*
* Now free the memory!
*/
checkCudaErrors( cudaFree(dev_p) );
checkCudaErrors( cudaFreeHost(host_p) );
checkCudaErrors( cudaFreeHost(host_result) );
return 0;
}

Forget about cudaMemcpy

There is another interesting feature of pinned memory: although it is allocated on the host, it is accessible from the device! The official documentation says the cudaHostAlloc "Allocates size bytes of host memory that is page-locked and accessible to the device". Let us give an example of how this is done by passing the device address of the variable that has been allocated using cudaHostAlloc directly to a kernel function. To do so, we need to explicitly define that our host allocation should be mapped using the flag cudaHostAllocMapped. Here is an example of use:

int main(void) {
int * host_p; /*< Host data allocated as pinned memory */
int * dev_ptr_p; /*< this pointer resides on the host */
int ns = 32;
int data_size = ns * sizeof(int);
checkCudaErrors(
cudaHostAlloc((void**) &host_p, data_size, cudaHostAllocMapped));
/* host_p = {1, 2, 3, ..., ns}*/
for (int i = 0; i < ns; i++)
host_p[i] = i + 1;
/*
* we can pass the address of `host_p`,
* namely `dev_ptr_p` to the kernel. This address
* is retrieved using cudaHostGetDevicePointer:
* */
checkCudaErrors(cudaHostGetDevicePointer(&dev_ptr_p, host_p, 0));
kernel<<<1, ns>>>(dev_ptr_p);
/*
* The following line is necessary for the host
* to be able to "see" the changes that have been done
* on `host_p`
*/
checkCudaErrors(cudaDeviceSynchronize());
for (int i = 0; i < ns; i++)
printf("host_p[%d] = %d\n", i, host_p[i]);
/* Free the page-locked memory */
checkCudaErrors(cudaFreeHost(host_p));
return 0;
}
view raw zero_copy.cu hosted with ❤ by GitHub


Notice that there is no cudaMemcpy involved, i.e., there is no explicit data transfer from the host to the device. Variable host_p is allocated as a page-locked variable host-side and then data is loaded directly onto it (as if we had used malloc). The kernel function is launched passing the address of this same variable on the device which is retrieved using cudaHostGetDevicePointer.

One little detail here is that cudaDeviceSynchronize must be called after the kernel execution to make sure that any changes that have been done one the variable from the device will be "synchronized" with the host. Finally, we print host_p with standard host-side code.



Does your device support it?

There's one more thing: does your device support host memory mapping? If yes, then make sure it is activated before you try out the code above. To do so, first we need to query the device (using cudaGetDeviceProperties) and then to set the device flags to cudaDeviceMapHost. Here is how your main function should start:

cudaDeviceProp prop;
int whichDevice;
checkCudaErrors( cudaGetDevice(&whichDevice));
checkCudaErrors( cudaGetDeviceProperties(&prop, whichDevice));
if (prop.canMapHostMemory != 1){
fprintf(stderr, "Device cannot map memory!\n");
return 1;
}
checkCudaErrors(cudaSetDeviceFlags(cudaDeviceMapHost));


3 comments: