Title A fast triangular solve on GPUs
Abstract The level 2 BLAS operation trsv performs a dense triangular solve, and is often used in the solve phase of a direct solver following a matrix factorization. With the advent of manycore architectures the importance of this memory-bound kernel is increasingly important, particularly for sparse direct solvers used in optimization applications. In this paper, a high performance implementation of the triangular solve is developed through a careful analysis of theoretical and practical bounds on the possible performance. This implementation outperforms the the CUBLAS by a factor of 5--15.
Organisation CSE-NAG , STFC , SCI-COMP
Report RAL Preprints RAL-P-2012-002. 2012. RAL-P-2012-002.pdf 2012
Journal Article SIAM J Sci Comput 35, no. 3 (2013): C303-C322. doi:10.1137/12088358X 88358.pdf 2013
