
About pinned memory
How pinned memory works is maybe best described in this post on http://devblogs.nvidia.com which I'm quoting here:Host (CPU) data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory, as illustrated below.
Here, we first start with a simple example of allocation of pinned memory using cudaMallocHost or cudaHostAlloc which will motivate the use of pinned memory. Let us first introduce cudaMallocHost; the function used to allocated page-locked host memory through the following simple example:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
int * host_p; | |
int * dev_p; | |
int main(void) { | |
int data_size = 4 * sizeof(int); | |
checkCudaErrors( | |
cudaHostAlloc((void**)&host_p, data_size, cudaHostAllocDefault)); | |
checkCudaErrors( | |
cudaMalloc((void**)&dev_p, data_size)); | |
/* Transfer data p --> dev_p */ | |
cudaMemcpy(dev_p, p, data_size, cudaMemcpyHostToDevice); | |
checkCudaErrors(cudaFreeHost(host_p)); | |
checkCudaErrors(cudaFree(dev_p)); | |
return 0 | |
} |
The above code is expected to execute fast compared to the case where malloc was used for host-side allocation. Pinned memory, however, cannot be used in every single case since "page-locked memory is a scarce resource" as NVIDIA puts it in the CUDA programming guide. The main take-home message here is that cudaMallocHost allocated page-locked host memory, while cudaMalloc allocates memory on the device. Note also that the host memory is freed with cudaFreeHost, while the device memory is freed using cudaFree.
A first example
Putting everything together, we have the following very simple example which involves memory allocation using cudaHostAlloc and cudaMalloc for the host and device variables respectively, a kernel invocation and, finally, we free the allocated memory.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <stdio.h> | |
#include <cuda_runtime.h> | |
#include "helper_cuda.h" | |
/* A very simple kernel function */ | |
__global__ void kernel(int *d_var) { d_var[threadIdx.x] += 10; } | |
int * host_p; | |
int * host_result; | |
int * dev_p; | |
int main(void) { | |
int ns = 4; | |
int data_size = ns * sizeof(int); | |
/* Allocate host_p as pinned memory */ | |
checkCudaErrors( | |
cudaHostAlloc((void**)&host_p, data_size, | |
cudaHostAllocDefault) ); | |
/* Allocate host_result as pinned memory */ | |
checkCudaErrors( | |
cudaHostAlloc((void**)&host_result, data_size, | |
cudaHostAllocDefault) ); | |
/* Allocate dev_p on the device global memory */ | |
checkCudaErrors( | |
cudaMalloc((void**)&dev_p, data_size) ); | |
/* Initialise host_p*/ | |
for (int i=0; i<ns; i++){ | |
host_p[i] = i + 1; | |
} | |
/* Transfer data to the device host_p .. dev_p */ | |
checkCudaErrors( | |
cudaMemcpy(dev_p, host_p, data_size, cudaMemcpyHostToDevice) ); | |
/* Now launch the kernel... */ | |
kernel<<<1, ns>>>(dev_p); | |
getLastCudaError("Kernel error"); | |
/* Copy the result from the device back to the host */ | |
checkCudaErrors( | |
cudaMemcpy(host_result, dev_p, data_size, cudaMemcpyDeviceToHost) ); | |
/* and print the result */ | |
for (int i=0; i<ns; i++){ | |
printf("result[%d] = %d\n", i, host_result[i]); | |
} | |
/* | |
* Now free the memory! | |
*/ | |
checkCudaErrors( cudaFree(dev_p) ); | |
checkCudaErrors( cudaFreeHost(host_p) ); | |
checkCudaErrors( cudaFreeHost(host_result) ); | |
return 0; | |
} |
Forget about cudaMemcpy
There is another interesting feature of pinned memory: although it is allocated on the host, it is accessible from the device! The official documentation says the cudaHostAlloc "Allocatessize
bytes of host memory that is page-locked and accessible to the device". Let us give an example of how this is done by passing the device address of the variable that has been allocated using cudaHostAlloc directly to a kernel function. To do so, we need to explicitly define that our host allocation should be mapped using the flag cudaHostAllocMapped. Here is an example of use:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
int main(void) { | |
int * host_p; /*< Host data allocated as pinned memory */ | |
int * dev_ptr_p; /*< this pointer resides on the host */ | |
int ns = 32; | |
int data_size = ns * sizeof(int); | |
checkCudaErrors( | |
cudaHostAlloc((void**) &host_p, data_size, cudaHostAllocMapped)); | |
/* host_p = {1, 2, 3, ..., ns}*/ | |
for (int i = 0; i < ns; i++) | |
host_p[i] = i + 1; | |
/* | |
* we can pass the address of `host_p`, | |
* namely `dev_ptr_p` to the kernel. This address | |
* is retrieved using cudaHostGetDevicePointer: | |
* */ | |
checkCudaErrors(cudaHostGetDevicePointer(&dev_ptr_p, host_p, 0)); | |
kernel<<<1, ns>>>(dev_ptr_p); | |
/* | |
* The following line is necessary for the host | |
* to be able to "see" the changes that have been done | |
* on `host_p` | |
*/ | |
checkCudaErrors(cudaDeviceSynchronize()); | |
for (int i = 0; i < ns; i++) | |
printf("host_p[%d] = %d\n", i, host_p[i]); | |
/* Free the page-locked memory */ | |
checkCudaErrors(cudaFreeHost(host_p)); | |
return 0; | |
} |
Notice that there is no cudaMemcpy involved, i.e., there is no explicit data transfer from the host to the device. Variable host_p is allocated as a page-locked variable host-side and then data is loaded directly onto it (as if we had used malloc). The kernel function is launched passing the address of this same variable on the device which is retrieved using cudaHostGetDevicePointer.
One little detail here is that cudaDeviceSynchronize must be called after the kernel execution to make sure that any changes that have been done one the variable from the device will be "synchronized" with the host. Finally, we print host_p with standard host-side code.
Does your device support it?
There's one more thing: does your device support host memory mapping? If yes, then make sure it is activated before you try out the code above. To do so, first we need to query the device (using cudaGetDeviceProperties) and then to set the device flags to cudaDeviceMapHost. Here is how your main function should start:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cudaDeviceProp prop; | |
int whichDevice; | |
checkCudaErrors( cudaGetDevice(&whichDevice)); | |
checkCudaErrors( cudaGetDeviceProperties(&prop, whichDevice)); | |
if (prop.canMapHostMemory != 1){ | |
fprintf(stderr, "Device cannot map memory!\n"); | |
return 1; | |
} | |
checkCudaErrors(cudaSetDeviceFlags(cudaDeviceMapHost)); |
great post!
ReplyDeleteThanks a lot!
DeleteVery helpful!
ReplyDelete