Say you have a CUDA kernel that operates on a number of different arrays d_A1, d_A2, ..., d_An which are not stored in the device memory consecutively, i.e., there is not a single array d_A which stores the elements of d_A1, d_A2, ..., d_An in some particular order. If one needs to pass all these arrays to the kernel, the pointers to pointers do the job. In this post we'll see how to use pointers to pointers in CUDA C.
Tuesday, 21 October 2014
Monday, 20 October 2014
High-performance matrix-vector multiplication in CUDA C
Here I present a custom kernel for matrix-vector multiplication written in CUDA C and some benchmarking results on a Tegra K1 (on a Jetson TK1 development board) and comparison to cuBLAS's function cublasSgemv. This is an open-source project which is hosted on github. This post comes, as I promised, as a sequel of an older post about matrix-vector multiplication in CUDA using shared memory. Further optimisation of the kernel is also possible (there are a few ideas), but for the time being I'm presenting some encouraging results...
Sunday, 19 October 2014
Timing CUDA kernels
Are you up to benchmarking your CUDA application? Are you looking for the easiest possible way on earth to time your kernels? Then, you're at the right place because in this post we're going through some code snippets to help you out. The implementation I present today is inspired by MATLAB's tic-toc simple interface. You start the timer with a tic() and get the elapsed time with a toc(). This saves a lot of lines of code and keeps you code simpler and tidy.
Friday, 17 October 2014
CUDA C - Tutorials and other resources

Tuesday, 14 October 2014
Matrix-vector multiplication using shared memory

Monday, 13 October 2014
Static allocation of __device__ vars

Saturday, 11 October 2014
Memories from CUDA - Pinned memory (III)

Friday, 10 October 2014
Memories from CUDA - Symbol Addresses (II)
In a previous post we gave a simple example of accessing constant memory in CUDA from inside a kernel function. What if we need to access it from the host (i.e., pass it as an argument to a kernel)? In this post we focus on how to use cudaGetSymbolAddress to get the address of a device variable (can be a __constant__ or a __device__). Nowadays, modern architectures support the keyword __managed__, but we're going to do things the old way...
Memories from CUDA - Constant memory (I)
This is a post about all of us who feel we can't memorize all these different types of CUDA memories. This is the first post from a series of upcoming posts on memory management in CUDA. Here, we present constant memory and we explain how it can be accessed from the the device through a step-by-step comprehensive example.
Thursday, 9 October 2014
Jetson to Arduino over Serial

Subscribe to:
Posts (Atom)