Li-Wen Chang, Ian Wetherbee, John A. Stratton
Sparse Linear Algebra
Sparse matrix-vector multiplication is the core of many iterative solvers. SpMV is memory-bandwidth bound when the matrix is large. Thus, most optimization efforts have focused on improving memory bandwidth for both regular and irregular access.
Sparse matrix data can be stored or transformed into many previously studied data layout patterns, such as compressed sparse row (CSR) format, ELLPACK (ELL) format, and Jagged Diagonal Storage (JDS) format. Each is designed to store non-zero elements efficiently with its own regularization approaches. Computation in these three formats typically results in regular accesses over non-zero elements, both value and index fields, and irregular, data-dependent accesses over the dense vector. Considering which format to use for the GPU-optimized version, we note that JDS and ELL formats both work with finer parallelism granularities more easily than CSR format. Particularly, the JDS format is well-designed for parallel or vector processors, and can be viewed as a modification of ELL format minimizing imbalance among adjacent rows through row permutation. Because of its better load balance characteristics, we chose to base the GPU-optimized version on JDS format.
JDS format naturally results in stride-one access for the sparse matrix elements. Padding may be introduced to align data, but introduces holes in the input that become overhead bandwidth for a bandwidth-limited kernel, and is only applicable on certain architectures where alignment is very crucial. Other optimization efforts focused on the irregular accesses to the dense vector. On a GPU architecture without a general cache, the texture unit's cache can be used to improve the efficiency of irregular accesses to the vector data. Prefetching is also applied to hide more memory latency when high thread-level parallelism is not sufficiently available to hide latency alone.
112.spmv's input consists of a sparse matrix in coordinate format, and a dense vector.
112.spmv outputs the result of the multiplication of the sparse matrix with the vector.
The output file spmv.out contains detailed timing information about the run. It also shows which device was selected along with what devices where available to OpenCL.
C, C++, OpenCL
Last updated: $Date: 2014-02-03 16:05:20 -0500 (Mon, 03 Feb 2014) $