The open archive for STFC research publications

Full Record Details

Persistent URL http://purl.org/net/epubs/work/62933
Record Status Checked
Record Id 62933
Title Fast triangular solve on GPUs
Abstract The solve phase of a sparse direct solver is memory bound, and in some applications it may be run tens of times for each factorization. Given the continuation of Moore's law, the compute-bound factorization phase can be performed very quickly, meaning performance of the solve phase is increasingly important. Modern GPUs have a significantly larger memory bandwidth than modern CPUs, and hence provide an attractive target upon which to execute the solve phase. The sparse solve phase is typically constructed from the dense triangular solve and matrix-vector multiply operations implemented as the level 2 BLAS routines _trsv and _gemv, respectively. The current NVIDIA CUBLAS implementation of _trsv fails to beat the host MKL performance on all except the largest matrices. In this talk, we describe how to improve performance by an order of magnitude through minimizing memory latency overheads and the use of global memory rather than kernel launches for synchronization.
Organisation CSE-NAG , STFC , SCI-COMP
Funding Information
Related Research Object(s): 63053
Licence Information:
Language English (EN)
Type Details URI(s) Local file(s) Year
Presentation Presented at Parallel Matrix Algorithms and Applications 2012 (PMAA 2012), Birkbeck University, London, UK, 28-30 Jun 2012. fast_triangular_solve.pdf 2012