Tuesday, 21 October 2014

CUDA and pointers to pointers

Say you have a CUDA kernel that operates on a number of different arrays d_A1, d_A2, ..., d_An which are not stored in the device memory consecutively, i.e.,  there is not a single array d_A which stores the elements of d_A1, d_A2, ..., d_An in some particular order. If one needs to pass all these arrays to the kernel, the pointers to pointers do the job. In this post we'll see how to use pointers to pointers in CUDA C.

Monday, 20 October 2014

High-performance matrix-vector multiplication in CUDA C

Here I present a custom kernel for matrix-vector multiplication written in CUDA C and some benchmarking results on a Tegra K1 (on a Jetson TK1 development board) and comparison to cuBLAS's function cublasSgemv. This is an open-source project which is hosted on github. This post comes, as I promised, as a sequel of an older post about matrix-vector multiplication in CUDA using shared memory. Further optimisation of the kernel is also possible (there are a few ideas), but for the time being I'm presenting some encouraging results...

Sunday, 19 October 2014

Timing CUDA kernels

Are you up to benchmarking your CUDA application? Are you looking for the easiest possible way on earth to time your kernels? Then, you're at the right place because in this post we're going through some code snippets to help you out. The implementation I present today is inspired by MATLAB's tic-toc simple interface. You start the timer with a tic() and get the elapsed time with a toc(). This saves a lot of lines of code and keeps you code simpler and tidy.

Friday, 17 October 2014

CUDA C - Tutorials and other resources

This is a collection of tutorials, blogs, articles and other resources for CUDA C that I hope you'll find useful. Feel free to contribute with a comment what you think can help people learn CUDA and optimise their code. I am planning to keep this post continuously updated, so stay tuned. Most of the links hosted here point to free resources.








Tuesday, 14 October 2014

Matrix-vector multiplication using shared memory

Matrix-vector multiplications, as well as matrix-matrix ones are essential for any sort numeric computations. GPGPUs alllow massive parallelisation of such operations. This post is about doing matrix-vector multiplications using CUDA with shared memory; a type of on-chip memory that is much faster than the global memory of the device - actually, shared memory is as high as 100 times faster than global memory provided that there are no bank conflicts between the threads. Algebraic operations such as matrix-vector multiplications can benefit a lot from the use of shared memory as we will show in this post.

Monday, 13 October 2014

Static allocation of __device__ vars

This is a brief post on how to allocate device memory statically using the __device__ keyword. We explain how one can statically allocate memory space on the device and initialize it as with standard C code. This memory is the available to the device but cannot be directly referenced from the host (unless cudaGetSymbolAddress is used as discussed in a previous post).




Saturday, 11 October 2014

Memories from CUDA - Pinned memory (III)

The main motivation for using pinned memory is to perform asynchronous transfers of data from the host to the device. This is accomplished using cudaMemcpyAsync and related functions.  Additionally, certain performance benefits come with pinned (or page-locked) memory and additional performance benefit can be obtained by using write-combined memory in certain cases. In this post we give a few examples about how to allocate pinned memory and we investigate its features.

Friday, 10 October 2014

Memories from CUDA - Symbol Addresses (II)

In a previous post we gave a simple example of accessing constant memory in CUDA from inside a kernel function. What if we need to access it from the host (i.e., pass it as an argument to a kernel)? In this post we focus on how to use cudaGetSymbolAddress to get the address of a device variable (can be a __constant__ or a __device__). Nowadays, modern architectures support the keyword __managed__, but we're going to do things the old way...

Memories from CUDA - Constant memory (I)

This is a post about all of us who feel we can't memorize all these different types of CUDA memories. This is the first post from a series of upcoming posts on memory management in CUDA. Here, we present constant memory and we explain how it can be accessed from the the device through a step-by-step comprehensive example.

Thursday, 9 October 2014

Jetson to Arduino over Serial


OK, so this is the plan: every time the temperature of my Jetson TK1 GPU reaches 60C or so, a siren starts howling, a blinding red light is blinking in the dark underground lab and it reflects on screens that output incomprehensible sequences of numbers and data, a robotic voice goes "evacuate the building," "this is an emergency" and other catchy phrases etc, etc..., but first I need to hook up Jetson TK1 to an Arduino board which outputs the temperature to an LCD display... Let's see how this is done...