Device-side static allocation
To get straight to the point, here's the code:Let us now go through the code and make some remarks. First device memory for variable dev_static is allocated statically, no cudaMemcpy has been used to transfer the data, no cudaMalloc was used to allocate space on the device. This of course assumes that we know beforehand the dimension of the array we need to allocate.
__device__ float dev_static[NS] = {10.0, -20.0, 235.0};
Now let's look at the kernel:
__global__ void kernel(float *dx) { int tid =threadIdx.x + blockIdx.x * blockDim.x; dx[tid] += dev_static[tid]; }
The __device__ variable dev_static is accessible from inside the kernel (but not from the host). It wouldn't be possible to pass the address of dev_static directly to the kernel (although there are ways to do so if necessary using cudaGetSymbolAddress).
Then, we have the kernel invocation:
kernel<<<1, NS>>>(dev_p);
where we pass the address of a variable that as been allocated on the device linear memory using cudaMalloc and that has been initialized using cudaMemcpy (the very standard way). As you may have guessed, the program prints:
host_p[0] = 11 host_p[1] = -18 host_p[2] = 238
No comments:
Post a Comment