Friday, 10 October 2014

Memories from CUDA - Constant memory (I)

This is a post about all of us who feel we can't memorize all these different types of CUDA memories. This is the first post from a series of upcoming posts on memory management in CUDA. Here, we present constant memory and we explain how it can be accessed from the the device through a step-by-step comprehensive example.


Constant Memory

Let us examine the following simple example in which we use constant memory to store some data on the constant memory space of the device which we then use in a read-only fashion inside a kernel. Let's look at the code step-by-step: First we include some necessary things in the header:

 #include <stdio.h>  
 #include <cuda_runtime.h>  
 #include "helper_cuda.h"  
   
 #define S 4  

we then define some variables that we will use in what follows. We define the host variables host_const and host_y and the device variables dev_y (the counterpart of host_y on the device) and dev_const_a (the counterpart of host_const) which is declared to be in the device constant memory.

 int host_const[S] = { 100, 200, 300, 400 };  
 int host_y[S] = { 1, 2, 3, 4 };  
 int *dev_y;  
   
 __constant__ int dev_const_a[4];  

We define the following kernel which simply increases any given device vector dev_var[i] by dev_const_a[i], where i is the index of a thread. This kernel doesn't do anything fancy, but shows how memory can be accessed:

 __global__ void kernel(int *dev_var) {  
   dev_var[threadIdx.x] += dev_const_a[threadIdx.x];  
 }  

It should be noted that this is a very bad use of constant memory: In this kernel each thread accesses a different address of constant memory. Accesses from different addresses in a half-warp will be serialized leading, therefore, to a considerable slow-down due to memory access. In such a case, it is better to use some non-constant memory (e.g., global memory, page-locked memory), but this is beyond the scope of this post.

This said, now let's go to the main function and see how we can use constant memory. The code follows the very standard CUDA coding pattern: host --> device memory --> processing on the device --> host. 

 int main(void) {  
   int data_size = S * sizeof(int);  
   int i;  

First we need to allocate memory on the device for dev_y. Device memory for the constant variable dev_const_a has been allocated statically.

   checkCudaErrors(cudaMalloc((void**) &dev_y, data_size));  

We now copy date from the host to the device to initialize the __constant__ variable and the device variable dev_y. According to the CUDA C Programming Guide, Section C.2.2.1.3, "Constants are immutable and may not be modified from the device," but we can initialize them from the host using cudaMemcpyToSymbol.

  checkCudaErrors(  
    cudaMemcpyToSymbol(dev_const_a, host_const, data_size, 0,  
    cudaMemcpyHostToDevice));  
   checkCudaErrors(cudaMemcpy(dev_y, host_y, data_size, cudaMemcpyHostToDevice));  

Now we launch the kernel function defined previously passing to it the address of the device variable dev_y. What will happen now (on the device) is that every element of dev_y (if dev_y is seen as an array) will be increased by the corresponding element of dev_const_a. The result will be stored in dev_y itself. We launch this in S parallel threads in one block.

   kernel<<<1,S>>>(dev_y);  

This little detail can prove tremendously useful when experimenting with CUDA in any way: after you have launched your kernel, check for errors. Function getLastCudaError is defined in helped_cuda.h, but you can write your own little function as well.asdf

   getLastCudaError("Kernel error");   

Next, we copy the data from the device (stored in variable dev_y) to the host (back to host_y) and print the result. 

  checkCudaErrors(cudaMemcpy(host_y, dev_y, data_size, cudaMemcpyDeviceToHost));  
   
   for (i = 0; i < S; i++) {  
     printf("%d\n", host_y[i]);  
   }  

Oops! I almost forgot... we need to free the memory we allocated for dev_y:

  if (dev_y)  
    checkCudaErrors(cudaFree(dev_y));  
  return 0;  
 }  

As expected, the result is:

101
202
303
404

The overall source code - for the sake of completeness - runs as follows:

Read more about constant memory on this very interesting, detailed and informative blog post.

No comments:

Post a Comment