Fast triangular solve on GPUs

ePubs

The open archive for STFC research publications

Persistent URL	http://purl.org/net/epubs/work/62933
Record Status	Checked
Record Id	62933
Title	Fast triangular solve on GPUs
Contributors	J Hogg (STFC Rutherford Appleton Lab.)
Abstract	The solve phase of a sparse direct solver is memory bound, and in some applications it may be run tens of times for each factorization. Given the continuation of Moore's law, the compute-bound factorization phase can be performed very quickly, meaning performance of the solve phase is increasingly important. Modern GPUs have a significantly larger memory bandwidth than modern CPUs, and hence provide an attractive target upon which to execute the solve phase. The sparse solve phase is typically constructed from the dense triangular solve and matrix-vector multiply operations implemented as the level 2 BLAS routines _trsv and _gemv, respectively. The current NVIDIA CUBLAS implementation of _trsv fails to beat the host MKL performance on all except the largest matrices. In this talk, we describe how to improve performance by an order of magnitude through minimizing memory latency overheads and the use of global memory rather than kernel launches for synchronization.
Organisation	CSE-NAG , STFC , SCI-COMP
Keywords
Funding Information
Related Research Object(s):	63053
Licence Information:
Language	English (EN)

Type	Details	URI(s)	Local file(s)	Year
Presentation	Presented at Parallel Matrix Algorithms and Applications 2012 (PMAA 2012), Birkbeck University, London, UK, 28-30 Jun 2012.		fast_triangular_solve.pdf	2012