Wednesday, November 13, 2013

CUDA-accelerated Sparse matrix assembly and solution using CUSP

1. If you use finite element methods for your numerical PDEs, chances are good that at some point in time you need to generate one or more large matrices.

2. Of course, chances are also good that you don't. It seems especially prevalent today, in the midst of an explosion of development in computing with massively parallel processors, that "global" variants of finite element methods that require large sparse matrices are shunned in favor of discontinuous, "local" formulations that avoid such structures.

3. Whatever your opinion may be on the matrix question, for the purposes of this post, assume that you do need to construct a global matrix - like a global mass, stiffness and damping matrix for a structural dynamics code; or maybe a large advection matrix, or a Laplacian operator -- whatever; a big matrix.

4. For simulations that are small (~<100 thousand elements) AND for simulations where you only have to construct these matrices once, the usual idiom of: looping through elements, integration points and interpolation points work fine - even in MATLAB.

5. On the other hand, if your simulation is a bit larger - though still something that your workstation should be able to handle - say ~10-100 million elements - OR if your simulation involves geometric and/or material non-linearities such that you need to generate these global matrices with great frequency: slow, serial, loop-driven assembly routines will not suffice.

6. Luckily, the nice, smart folks at NVIDIA have provided some nifty tools to make assembly of large sparse matrices with CUDA an only mildly painful experience.

7. In general, for generating your sparse matrix, the basic steps are:


a. generate an unordered list of 3-tuples composed of {row #,column #, value}. These would be generated by your assembly routine and constitute the data for the sparse matrix. There would be one tuple for each loop through the element->integration_point->interpolation_point nest of loops.

b. Provide these vectors to a routine that generates your sparse matrix in a standard format suitable for your PDE solver.

8. In MATLAB, this is easy. You write a routine to generate the tuples (I usually store them as 3 vectors, or a 2D array with 3 columns), then feed them to the built-in function sparse like so:

?
1
2
3
K = sparse(row_vec,col_vec,val_vec,num_rows,num_cols);


and *POOF* you're done!


9. So how to do this with CUDA?


10. Wouldn't it be nice if there was a CUDA library that would do exactly that? (*cough* hint, hint)

Fortunately, one does almost exactly that.

CUSP.

11. If you haven't tried this already, visit the main CUSP page on Google code here and download the source. Since it's a template library, all you need to do is copy the source folder into your cuda include directory and you're done. If you do not have permission to write to the cuda include directory, just put the cusp folder somewhere convenient and add this to the include path for your compilation line.

12. One of the example codes that the CUSP developers provide is a sparse matrix assembly routine (code is here)

13. Health Warning If you're NOT a seasoned C++ user, you may need to pause a bit after viewing the source code to let your headache clear. Maybe a small glass of Talisker will help. If not, try a larger glass. While you're recovering, take heart and have faith that reading the code, studying the code and straining yourself over the online THRUST tutorials and demonstrations will make you a better person and maybe even convince you to like C++ a bit more.

14. Or maybe not. But either way, it's a super-slick little piece of code. I wish I could say I wrote it.

15. You can use the core of this routine to write your own function to generate a properly formatted sparse matrix in device memory.

16. This matrix is then ready to be used along with the BLAS-type functions provided by CUSP for time-dependent problems or, if you have a steady-state problem or otherwise one that requires solving this system of equations, you can use the growing collection of preconditioners and iterative solvers provided with CUSP.

17. Performance for the assembly routine is, of course, implementation dependent and - if you're like me - the actual assembly part using the code provided is much more efficiently implemented than the kernel you write to generate the three input vectors. For constructing and assembling a finite element Laplacian matrix on my GTX-580, the routine required less than 0.01 seconds for a small system with 12 thousand degrees of freedom with 23 thousand elements. For a somewhat larger system with 750K degrees of freedom and 1.5M elements, the routine took about 0.25 second; both times were a small fraction of what was required using my usual MATLAB routines.

18. There are other things that I would say about CUSP, but this post is already too long. In my opinion, it is much easier to use than CUSPARSE on which, I assume, it is based. I love the Matrix Market input/output options. Of course I also love that it doesn't cost money; but that's just me.

19. As for the ease-of-use for the CUSP-provided sparse matrix solver, I can only say that it would be hard to ask for anything much easier to use within a C++ environment.

As a quick example, try the following code along with a sparse, positive definite matrix from Matrix Market.

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <cusp/csr_matrix.h>
#include <cusp/monitor.h>
#include <cusp/krylov/gmres.h>
#include <cusp/io/matrix_market.h>
#include <cusp/precond/smoothed_aggregation.h>
 
using namespace std;
 
int main(void)
{
 
//create an empty sparse matrix data tructure
cusp::csr_matrix<int,float,cusp::device_memory> A;
cout << "Reading in the matrix..." << endl;
//import a matrix from matrix market format
cusp::io::read_matrix_market_file(A,"boneS01.mtx");
 
//initialize the solution and right-hand-side vectors
cusp::array1d<float,cusp::device_memory> x(A.num_rows,0);
cusp::array1d<float,cusp::device_memory> b(A.num_rows,1);
 
//set up stoppping critiera
//iteration limit = 5000;
/relative tolerance = 1e-5;
cusp::verbose_monitor<float> monitor(b,5000,1e-5);
int restart = 50;
 
// set preconditioner
cout << "Forming preconditioner..." << endl;
cusp::precond::smoothed_aggregation<int,float,cusp::device_memory> M(A);
 
cout << "Solving the system..." << endl;
clock_t begin=clock();
cusp::krylov::gmres(A,x,b,restart,monitor,M);
clock_t done = clock();
 
cout << "Elapsed time = "
< < (double(done) - double(begin))/double(CLOCKS_PER_SEC)
< < endl;
 
return 0;



20. So that's it. Easy huh? In general, I've found that for the sparse-matrix solving phase even with a decent card like the GTX-580 or the C2070 it is hard to expect stupendous speedups over a good CPU sparse matrix library. Maybe 10x or so (if that) for a big (yet still single-workstation-size) system. Still, I claim that the CUSP library is easily an order of magnitude easier to use than something like the Intel Math Kernel Library; but that also is just my opinion.

21. There are other options; I've given the EM Photonics CULA Sparse a try via their free demonstration, but I cannot even pretend to give a fair comparison. If you know of someone who has, I would be interested in hearing about it.

-- Many thanks (Source :: http://noiceinmyscotchplease.blogspot.co.uk ) Great blog to look into for many a things
Subscribe to RSS Feed Follow me on Twitter!