Code of Honour: High-performance matrix-vector multiplication in CUDA C

Here I present a custom kernel for matrix-vector multiplication written in CUDA C and some benchmarking results on a Tegra K1 (on a Jetson TK1 development board) and comparison to cuBLAS's function cublasSgemv. This is an open-source project which is hosted on github. This post comes, as I promised, as a sequel of an older post about matrix-vector multiplication in CUDA using shared memory. Further optimisation of the kernel is also possible (there are a few ideas), but for the time being I'm presenting some encouraging results...

The kernel

First of all, I would like to thank the community of stackoverfow and all the people who offered me constructive hints on a question I posted.

In all that follows it is assumed that the values of matrix A are stored in column-major order (Fortran style, that is column-by-column). I also tried the row-major variant, but it leads to un-coalesced global memory access which creates a considerable slow down of the algorithm. I recommend these slides by P. Micikevicius for more information about memory coalescing for high performance computing.

Here is the code of the kernel function:

Determination of the optimal block size

For this purpose we need to do some benchmarking. The results reported here are taken on a Tegra K1 GPU running on Jetson TK1. Optimal tuning is likely to be different on other hardware. The results presented hereafter were obtained using the nvcc argument -O3 (maximum code optimisation).

Maybe 32 is a good choice for the block size if this is to be kept constant regardless of the size of the matrix. In all cases, blocks size of 32, 64 and 128 were found to produce decent results in all cases. Here is a series of benchmarks to assess the effect of the block size on the execution speed. First let's keep the number of columns fixed and see how the execution time varies with the number of rows:

and let's now fix the number or rows and study the dependence on the number of columns:

All measurements of computation time are average values against 50 runs, but the variance is in all cases low.

It seems that matrices with more columns are better accommodated by a larger block size. It also makes sense that the block size is smaller or equal than the column-size of the matrix to avoid thread divergence.

Comparison to cublasSgemv

Here we compare the performance of this custom kernel with cublasSgemv. Prior to benchmarking we have verified that the two methods return the same result up to an allowed relative error of 0.0001.

Next steps

More work needs to be done to properly assess the performance of this custom kernel, so I'll update soon this post with more results. One may be interested in determining the optimal configuration for the kernel before launching it based on the dimension of the matrix.

Code of Honour

Pages

Monday, 20 October 2014

High-performance matrix-vector multiplication in CUDA C

The kernel

Determination of the optimal block size

Comparison to cublasSgemv

Next steps

2 comments: