Download presentation
Presentation is loading. Please wait.
Published byBryan O’Brien’ Modified over 9 years ago
1
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application in Circuit Simulation Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University Ling Ren 1
2
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Abstract 2
3
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline Background Sparse LU factorization Dense LU factorization Summary 3
4
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Background SPICE: the most popular circuit simulator Simulating VSLI (~1 billion transistors) takes several days Bottleneck: Sparse LU factorization Dynamic fluids, structural, economics … 4 Bottleneck
5
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline Background Sparse LU factorization Dense LU factorization Summary 5
6
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - related works [SuperLU 1999] Sequential, multi-thread, distributed versions Incorporate Supernode, efficent for dense blocks [Pardiso 2002] Sequential, multi-thread, distributed, GPU [Christen2007] versions Adopt Supernode But supernodes rarely form in circuit matrices [KLU 2010] Optimized for circuit matrices Only sequential, use G/P left looking algorithm [G/P 1988] Adopt BTF, without Supernode 6
7
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization – left-looking Sequentially process each column When processing column k, use all the columns on the left (1, 2,..., k-1) to update column k. Update = vector multiply-and-add (MAD) 7 read+write>arithmetic Update
8
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm description – EGraph Every column is updated with several columns on its left Nonzero structure of U determines the dependency 8 Vector MAD (b)EGraph (a) Upper triangular matrix U nonzero
9
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm analysis – two kinds of parallelism 9 Pipeline parallelism, alone with timing order Column 1 Column 2 Column 3 Column 4...... Overlapped factorization in pipeline mode Thread 1 Thread 2 Divide columns into levels: columns in the same level are independent of each other Cluster mode: many columns factorized in parallel Pipeline mode: Overlap columns from different levels
10
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University 10 Sparse LU factorization - workflow
11
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - preprocessing Preprocessing: only once on CPU MC64 to ensure numerical stability [MC64]; Approximate Minimum Degree to reduce fill-ins [AMD] ; pre-factorization (numeric factorization with partial pivoting) to calculate the symbolic structure of L and U. Sorting the nonzeros of L and U (introduced later) 11
12
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization – on GPU GPU inputs Location and values of nonzeros in A Location of nonzeros in L and U The Escheduler GPU outputs Values of nonzeros in L and U CSC (Compressed Sparse Column) format for sparse matrices A, L and U 12
13
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - avoid deadlock In traditional GPU programs, some wavefronts are inactive at the beginning (limited resource etc.). They wait for other active wavefronts to finish and then become active. But in sparse LU, we must ensure all wavefronts are active from the beginning 13
14
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - data formats data formats for intermediate results : dense arrays vs. CSC CSC (Compressed Sparse Column) Can be put in local memory Indexed accesses inconvenient (binary search) Using too much local memory reduces active work- groups, which leads to severe performance loss Dense arrays > CSC format: 2.5x 14
15
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - data locality Higher global memory bandwidth if consecutive work-items access consecutive address Improve data locality Nonzeros of L and U are out-of-order after preprocessing, sort them according to row indices 1.7x speedup, overheads negligible Performed only once, incorporated into preprocessing 15
16
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Experimental setups CPU 2 Xeon E5405 CPUs (8 cores in total) 2x6 MB L2 cache, 16GB ram GPU AMD Radeon 5870 GPU Testing matrices University of Florida Sparse Matrix Collection [Davis] http://www.cise.ufl.edu/research/sparse/matrices/ 16
17
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - Experimental results GPU speedups positively related to floating point operations (flops) 17
18
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - Experimental results Matrices divided into 4 groups First three groups according to Mflops GPU speedup positively related to Mflops 4 th group: denormal floating point numbers Used to represent extremely small numbers Very slowly on CPU, full speed support on GPU An advantage of GPU in sparse LU and scientific computing Very high speedups for this group 18
19
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - Experimental results Average speedup of each group 19 GroupGPU bandwidth GB / s Over 1 CPU Over 4 CPUs Over 8 CPUs Over KLU 10.810.410.240.220.58 210.972.430.850.553.64 352.5910.533.652.5815.58 436.8226.868.014.4825.61 All15.914.511.641.136.25
20
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Scalability – BBD Problem How to use multiple GPUs? Circuit-partition-based simulation algorithm bordered-block-diagonal (BBD) Diagonal blocks are factorized independently But An becomes dense. So we need dense LU factorization 20
21
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline Background Sparse LU factorization Dense LU factorization Summary 21
22
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Dense LU Factorization – blocked algorithm 22 Three core operations Dense LU factorization Triangular matrix inversion Matrix multiplication Suitable for GPU GEMM most frequent GEMM very efficient on GPU 920 Gflop/s (single), 290 Gflop/s (double) finishedLU + inverse GEMM
23
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University 443 Gflop/s (single), 163 Gflop/s (double) 23 Dense LU Factorization – performance
24
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Comparison to previous studies 24 Dense LU Factorization – related works Performance of Dense LU Factorization WorkHardwareSingleDouble [Galoppo2005]GTX 780010-- [Volkov2008]GTX 8800179-- [Tomov2010]8 Xeon Harpertown10050 [Tomov2010]GTX 280300-- [Tomov2010]8 Xeon Harpertown + GTX 28038899 OursRadeon 5870443163
25
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Dense LU Factorization – further improvement CPU BLAS for Gaussian elimination 100 Gflop/s GEMM can be further improved Scalability to multiple GPUs Blocked dense LU: independent GEMMs on multiple GPUs Diagonal blocks in BBD on multiple GPUs Linear performance improvement expected 25
26
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Summary First work on GPU sparse LU factorization Exploit parallelism of left-looking algorithm Blocked dense LU factorization 443 Gflop/s (single), 163 Gflop/s (double) Supplement to OpenCL BLAS Accelerate SPICE simulators 26
27
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Reference [SPICE] L. W. Nagel, “SPICE 2: A computer program to stimulate semiconductor circuits,” Ph.D. dissertation, University of California, Berkeley, 1975. [SuperLU1999] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu, “A supernodal approach to sparse partial pivoting,” SIAM J. Matrix Analysis and Applications, vol. 20, no. 3, pp. 720–755, 1999 [Pardiso2002] O. Schenk and K. Gartner, “Solving unsymmetric sparse systems of linear equations with pardiso,” Computational Science - ICCS 2002, vol. 2330, pp. 355–363, 2002. [G/P 1988] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportional to arithmetic operations,” SIAM J. Sci. Statist. Comput., vol. 9, pp. 862– 874, 1988 [KLU2010] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: KLU, a direct sparse solver for circuit simulation problems,” ACM Trans. Math. Softw., vol. 37, pp. 36:1–36:17, September 2010. 27
28
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Reference [Christen2007] M. Christen, O. Schenk, and H. Burkhart, “General-purpose sparse matrix building blocks using the nvidia cuda technology platform,” 2007. [Davis] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” to appear in ACM Transactions on Mathematical Software. [Galoppo2005] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, “LU- GPU: Efficient algorithms for solving dense linear systems on graphics hardware,” SC Conference, vol. 0, p. 3, 2005. [Volkov2008] V. Volkov and J. Demmel, “LU, QR and Cholesky factorizations using vector capabilities of gpus,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2008-49, May 2008. [Tomov2010] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid gpu accelerated manycore systems,” Parallel Comput., vol. 36, pp. 232–240, June 2010. 28
29
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Reference [MC64] I. S. Duff and J. Koster, “The design and use of algorithms for permuting large entries to the diagonal of sparse matrices,” SIAM J. Matrix Anal. and Applics, no. 4, pp. 889–901, 1997. [AMD] P. R. Amestoy, Enseeiht-Irit, T. A. Davis, and I. S. Duff, “Algorithm 837: AMD, an approximate minimum degree ordering algorithm,” ACM Trans. Math. Softw., vol. 30, pp. 381–388, September 2004. 29
30
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Thank you ! 30 Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University
31
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization – Terminology Elimination Graph Definition An edge from j to k iff U(j, k) != 0 In the following context, node = column 31 Level Definition – The length of the longest path from any source node to itself. – Source nodes have no incoming edges.
32
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - Experimental results 32
33
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Dense LU factorization – Basic algorithm 33 Blocked LU factorization
34
Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Dense LU factorization – Basic algorithm 34
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.