Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
Lecture 6: Multicore Systems
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
MATH 685/ CSI 700/ OR 682 Lecture Notes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Sparse Matrices in Matlab John R. Gilbert Xerox Palo Alto Research Center with Cleve Moler (MathWorks) and Rob Schreiber (HP Labs)
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
OpenFOAM on a GPU-based Heterogeneous Cluster
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
The Evolution of a Sparse Partial Pivoting Algorithm John R. Gilbert with: Tim Davis, Jim Demmel, Stan Eisenstat, Laura Grigori, Stefan Larimore, Sherry.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
CS 290H Lecture 12 Column intersection graphs, Ordering for sparsity in LU with partial pivoting Read “Computing the block triangular form of a sparse.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
September 15, Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Lecture 5 Parallel Sparse Factorization, Triangular Solution
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
JAVA AND MATRIX COMPUTATION
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Lecture 4 Sparse Factorization: Data-flow Organization
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Administrivia: October 5, 2009 Homework 1 due Wednesday Reading in Davis: Skim section 6.1 (the fill bounds will make more sense next week) Read section.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
Irregular Applications –Sparse Matrix Vector Multiplication
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Symmetric-pattern multifrontal factorization T(A) G(A)
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Parallel Direct Methods for Sparse Linear Systems
Parallel Plasma Equilibrium Reconstruction Using GPU
Richard Dorrance Literature Review: 1/11/13
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Linchuan Chen, Peng Jiang and Gagan Agrawal
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CSCE569 Parallel Computing
University of Wisconsin-Madison
Dense Linear Algebra (Data Distributions)
Parallelized Analytic Placer
Presentation transcript:

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application in Circuit Simulation Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University Ling Ren 1

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Abstract 2

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline  Background  Sparse LU factorization  Dense LU factorization  Summary 3

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Background  SPICE: the most popular circuit simulator  Simulating VSLI (~1 billion transistors) takes several days  Bottleneck: Sparse LU factorization  Dynamic fluids, structural, economics … 4 Bottleneck

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline  Background  Sparse LU factorization  Dense LU factorization  Summary 5

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - related works  [SuperLU 1999] Sequential, multi-thread, distributed versions Incorporate Supernode, efficent for dense blocks  [Pardiso 2002] Sequential, multi-thread, distributed, GPU [Christen2007] versions Adopt Supernode  But supernodes rarely form in circuit matrices  [KLU 2010] Optimized for circuit matrices Only sequential, use G/P left looking algorithm [G/P 1988] Adopt BTF, without Supernode 6

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization – left-looking  Sequentially process each column  When processing column k, use all the columns on the left (1, 2,..., k-1) to update column k.  Update = vector multiply-and-add (MAD) 7 read+write>arithmetic Update

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm description – EGraph  Every column is updated with several columns on its left  Nonzero structure of U determines the dependency 8 Vector MAD (b)EGraph (a) Upper triangular matrix U nonzero

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Algorithm analysis – two kinds of parallelism 9 Pipeline parallelism, alone with timing order Column 1 Column 2 Column 3 Column Overlapped factorization in pipeline mode Thread 1 Thread 2  Divide columns into levels: columns in the same level are independent of each other  Cluster mode: many columns factorized in parallel  Pipeline mode: Overlap columns from different levels

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University 10 Sparse LU factorization - workflow

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - preprocessing  Preprocessing: only once on CPU  MC64 to ensure numerical stability [MC64];  Approximate Minimum Degree to reduce fill-ins [AMD] ;  pre-factorization (numeric factorization with partial pivoting) to calculate the symbolic structure of L and U.  Sorting the nonzeros of L and U (introduced later) 11

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization – on GPU  GPU inputs  Location and values of nonzeros in A  Location of nonzeros in L and U  The Escheduler  GPU outputs  Values of nonzeros in L and U  CSC (Compressed Sparse Column) format for sparse matrices A, L and U 12

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - avoid deadlock  In traditional GPU programs, some wavefronts are inactive at the beginning (limited resource etc.). They wait for other active wavefronts to finish and then become active.  But in sparse LU, we must ensure all wavefronts are active from the beginning 13

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - data formats  data formats for intermediate results : dense arrays vs. CSC  CSC (Compressed Sparse Column) Can be put in local memory Indexed accesses inconvenient (binary search) Using too much local memory reduces active work- groups, which leads to severe performance loss  Dense arrays > CSC format: 2.5x 14

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - data locality  Higher global memory bandwidth if consecutive work-items access consecutive address  Improve data locality  Nonzeros of L and U are out-of-order after preprocessing, sort them according to row indices  1.7x speedup, overheads negligible  Performed only once, incorporated into preprocessing 15

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Experimental setups  CPU  2 Xeon E5405 CPUs (8 cores in total)  2x6 MB L2 cache, 16GB ram  GPU  AMD Radeon 5870 GPU  Testing matrices  University of Florida Sparse Matrix Collection [Davis] 16

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - Experimental results  GPU speedups positively related to floating point operations (flops) 17

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - Experimental results  Matrices divided into 4 groups  First three groups according to Mflops GPU speedup positively related to Mflops  4 th group: denormal floating point numbers Used to represent extremely small numbers Very slowly on CPU, full speed support on GPU  An advantage of GPU in sparse LU and scientific computing Very high speedups for this group 18

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - Experimental results  Average speedup of each group 19 GroupGPU bandwidth GB / s Over 1 CPU Over 4 CPUs Over 8 CPUs Over KLU All

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Scalability – BBD  Problem  How to use multiple GPUs?  Circuit-partition-based simulation algorithm  bordered-block-diagonal (BBD)  Diagonal blocks are factorized independently  But An becomes dense. So we need dense LU factorization 20

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Outline  Background  Sparse LU factorization  Dense LU factorization  Summary 21

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Dense LU Factorization – blocked algorithm 22  Three core operations  Dense LU factorization  Triangular matrix inversion  Matrix multiplication  Suitable for GPU  GEMM most frequent  GEMM very efficient on GPU 920 Gflop/s (single), 290 Gflop/s (double) finishedLU + inverse GEMM

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University  443 Gflop/s (single), 163 Gflop/s (double) 23 Dense LU Factorization – performance

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University  Comparison to previous studies 24 Dense LU Factorization – related works Performance of Dense LU Factorization WorkHardwareSingleDouble [Galoppo2005]GTX [Volkov2008]GTX [Tomov2010]8 Xeon Harpertown10050 [Tomov2010]GTX [Tomov2010]8 Xeon Harpertown + GTX OursRadeon

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Dense LU Factorization – further improvement  CPU BLAS for Gaussian elimination 100 Gflop/s  GEMM can be further improved  Scalability to multiple GPUs  Blocked dense LU: independent GEMMs on multiple GPUs  Diagonal blocks in BBD on multiple GPUs  Linear performance improvement expected 25

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Summary  First work on GPU sparse LU factorization  Exploit parallelism of left-looking algorithm  Blocked dense LU factorization  443 Gflop/s (single), 163 Gflop/s (double)  Supplement to OpenCL BLAS  Accelerate SPICE simulators 26

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Reference  [SPICE] L. W. Nagel, “SPICE 2: A computer program to stimulate semiconductor circuits,” Ph.D. dissertation, University of California, Berkeley,  [SuperLU1999] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu, “A supernodal approach to sparse partial pivoting,” SIAM J. Matrix Analysis and Applications, vol. 20, no. 3, pp. 720–755, 1999  [Pardiso2002] O. Schenk and K. Gartner, “Solving unsymmetric sparse systems of linear equations with pardiso,” Computational Science - ICCS 2002, vol. 2330, pp. 355–363,  [G/P 1988] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportional to arithmetic operations,” SIAM J. Sci. Statist. Comput., vol. 9, pp. 862– 874, 1988  [KLU2010] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: KLU, a direct sparse solver for circuit simulation problems,” ACM Trans. Math. Softw., vol. 37, pp. 36:1–36:17, September

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Reference  [Christen2007] M. Christen, O. Schenk, and H. Burkhart, “General-purpose sparse matrix building blocks using the nvidia cuda technology platform,”  [Davis] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” to appear in ACM Transactions on Mathematical Software.  [Galoppo2005] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, “LU- GPU: Efficient algorithms for solving dense linear systems on graphics hardware,” SC Conference, vol. 0, p. 3,  [Volkov2008] V. Volkov and J. Demmel, “LU, QR and Cholesky factorizations using vector capabilities of gpus,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS , May  [Tomov2010] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid gpu accelerated manycore systems,” Parallel Comput., vol. 36, pp. 232–240, June

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Reference  [MC64] I. S. Duff and J. Koster, “The design and use of algorithms for permuting large entries to the diagonal of sparse matrices,” SIAM J. Matrix Anal. and Applics, no. 4, pp. 889–901,  [AMD] P. R. Amestoy, Enseeiht-Irit, T. A. Davis, and I. S. Duff, “Algorithm 837: AMD, an approximate minimum degree ordering algorithm,” ACM Trans. Math. Softw., vol. 30, pp. 381–388, September

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Thank you ! 30 Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization – Terminology  Elimination Graph Definition  An edge from j to k iff U(j, k) != 0  In the following context, node = column 31 Level Definition – The length of the longest path from any source node to itself. – Source nodes have no incoming edges.

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Sparse LU factorization - Experimental results 32

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Dense LU factorization – Basic algorithm 33  Blocked LU factorization

Nano-scale Integrated Circuit and System Lab. Department of Electronic Engineering, Tsinghua University Dense LU factorization – Basic algorithm 34