Sparse Direct Solvers on High Performance Computers X. Sherry Li CS267: Applications of Parallel Computers March.

Slides:



Advertisements
Similar presentations
05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Lecture 19: Parallel Algorithms
Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
Lecture 3 Sparse Direct Method: Combinatorics Xiaoye Sherry Li Lawrence Berkeley National Laboratory, USA crd-legacy.lbl.gov/~xiaoye/G2S3/
ECE 552 Numerical Circuit Analysis Chapter Four SPARSE MATRIX SOLUTION TECHNIQUES Copyright © I. Hajj 2012 All rights reserved.
Siddharth Choudhary.  Refines a visual reconstruction to produce jointly optimal 3D structure and viewing parameters  ‘bundle’ refers to the bundle.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Numerical Algorithms Matrix multiplication
Solution of linear system of equations
Symmetric Minimum Priority Ordering for Sparse Unsymmetric Factorization Patrick Amestoy ENSEEIHT-IRIT (Toulouse) Sherry Li LBNL/NERSC (Berkeley) Esmond.
Numerical Algorithms • Matrix multiplication
CS 584. Review n Systems of equations and finite element methods are related.
Sparse Direct Solvers on High Performance Computers X. Sherry Li CS267: Applications of Parallel Computers April.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
Sparse Direct Methods on High Performance Computers X. Sherry Li CS267/E233: Applications of Parallel Computing.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization.
Programming Tools and Environments: Linear Algebra James Demmel Mathematics and EECS UC Berkeley.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CS240A: Conjugate Gradients and the Model Problem.
The Evolution of a Sparse Partial Pivoting Algorithm John R. Gilbert with: Tim Davis, Jim Demmel, Stan Eisenstat, Laura Grigori, Stefan Larimore, Sherry.
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
CS 290H Lecture 12 Column intersection graphs, Ordering for sparsity in LU with partial pivoting Read “Computing the block triangular form of a sparse.
SuperLU: Sparse Direct Solver X. Sherry Li SIAM Short Course on the ACTS Collection Feburary 27, 2004.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.
CS 290H Lecture 5 Elimination trees Read GLN section 6.6 (next time I’ll assign 6.5 and 6.7) Homework 1 due Thursday 14 Oct by 3pm turnin file1.
Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a.
CS 584. Load Balancing Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
 6.2 Pivoting Strategies 1/17 Chapter 6 Direct Methods for Solving Linear Systems -- Pivoting Strategies Example: Solve the linear system using 4-digit.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
JAVA AND MATRIX COMPUTATION
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Solution of Sparse Linear Systems
Lecture 4 Sparse Factorization: Data-flow Organization
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
1 1  Capabilities: Serial (C), shared-memory (OpenMP or Pthreads), distributed-memory (hybrid MPI+ OpenM + CUDA). All have Fortran interface. Sparse LU.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
1 1  Capabilities: Serial (thread-safe), shared-memory (SuperLU_MT, OpenMP or Pthreads), distributed-memory (SuperLU_DIST, hybrid MPI+ OpenM + CUDA).
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
Report from LBNL TOPS Meeting TOPS/ – 2Investigators  Staff Members:  Parry Husbands  Sherry Li  Osni Marques  Esmond G. Ng 
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Symmetric-pattern multifrontal factorization T(A) G(A)
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
Laplacian Matrices of Graphs: Algorithms and Applications ICML, June 21, 2016 Daniel A. Spielman.
CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning
High Performance Computing Seminar
Lecture 3 Sparse Direct Method: Combinatorics
Parallel Direct Methods for Sparse Linear Systems
Solving Linear Systems Ax=b
CS 290H Administrivia: April 16, 2008
Sparse Direct Methods on High Performance Computers
Sparse Matrix Methods on High Performance Computers
CSCE569 Parallel Computing
Read GLN sections 6.1 through 6.4.
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Sparse Direct Solvers on High Performance Computers X. Sherry Li CS267: Applications of Parallel Computers March 2, 2005

CS267: Lecture 12 2 Review of Gaussian Elimination (GE)  Solving a system of linear equations Ax = b  First step of GE:  Repeats GE on C  Results in LU factorization (A = LU)  L lower triangular with unit diagonal, U upper triangular  Then x is obtained by solving two triangular systems with L and U

CS267: Lecture 12 3 Sparse GE  Sparse systems are ubiquitous in science and engineering  Example: A of dimension 10 5, only 10~100 nonzeros per row  Goal: Store only nonzeros and perform operations only on nonzeros  Fill-in: original zero entry a ij becomes nonzero in L and U Natural order: nonzeros = 233 Min. Degree order: nonzeros = 207

CS267: Lecture 12 4 Numerical Stability: Need for Pivoting  One step of GE:   If α is small, some entries in B may be lost from addition  Pivoting: swap the current diagonal entry with a larger entry from the other part of the matrix  Goal: control element growth in L & U

CS267: Lecture 12 5 Dense versus Sparse GE  Dense GE: P r A P c = LU  P r and P c are permutations chosen to maintain stability  Partial pivoting suffices in most cases : P r A = LU  Sparse GE: P r A P c = LU  P r and P c are chosen to maintain stability and preserve sparsity

CS267: Lecture 12 6 Algorithmic Issues in Sparse GE  Minimize number of fill-ins, maximize parallelism  Sparsity structure of L & U depends on that of A, which can be changed by row/column permutations (vertex re-labeling of the underlying graph)  Ordering (combinatorial algorithms; NP-complete to find optimum [Yannakis ’83]; use heuristics)  Predict the fill-in positions in L & U  Symbolic factorization (combinatorial algorithms)  Design efficient data structure for storing and quick retrieval of the nonzeros  Compressed storage schemes  Perform factorization and triangular solutions  Numerical algorithms (F.P. operations only on nonzeros)  How and when to pivot ?  Usually dominate the total runtime

CS267: Lecture 12 7 Numerical Pivoting  Goal of pivoting is to control element growth in L & U for stability  For sparse factorizations, often relax the pivoting rule to trade with better sparsity and parallelism (e.g., threshold pivoting, static pivoting,...)  Partial pivoting used in sequential SuperLU (GEPP)  Can force diagonal pivoting (controlled by diagonal threshold)  Hard to implement scalably for sparse factorization  Static pivoting used in SuperLU_DIST (GESP)  Before factor, scale and permute A to maximize diagonal: P r D r A D c = A’  P r is found by a weighted bipartite matching algorithm on G(A)  During factor A’ = LU, replace tiny pivots by , without changing data structures for L & U  If needed, use a few steps of iterative refinement to improve the first solution  Quite stable in practice b sx x x x x

CS267: Lecture 12 8 Static Pivoting via Weighted Bipartite Matching  Maximize the diag. entries: sum, or product (sum of logs)  Hungarian algo. or the like (MC64): O(n*(m+n)*log n)  Auction algo. (more parallel): O(n*m*log(n*C)) 11 A G(A) rowcolumn x x 3 x 4 5

CS267: Lecture 12 9 Numerical Accuracy: GESP versus GEPP

CS267: Lecture Structural Gaussian Elimination - Symmetric Case 1 i j k Eliminate 1 i k j Undirected graph After a vertex is eliminated, all its neighbors become a clique The edges of the clique are the potential fills (upper bound !) i jk i jk Eliminate 1 1 i j k 1 i j k

CS267: Lecture Minimum Degree Ordering  Greedy approach: do the best locally  At each step  Eliminate the vertex with the smallest degree  Update degrees of the neighbors  Straightforward implementation is slow and requires too much memory  Newly added edges are more than eliminated vertices

CS267: Lecture Minimum Degree Ordering  Use quotient graph as a compact representation [George/Liu ’78]  Collection of cliques resulting from the eliminated vertices affects the degree of an uneliminated vertex  Represent each connected component in the eliminated subgraph by a single “supervertex”  Storage required to implement QG model is bounded by size of A  Large body of literature on implementation variations  Tinney/Walker `67, George/Liu `79, Liu `85, Amestoy/Davis/Duff `94, Ashcraft `95, Duff/Reid `95, et al.,...

CS267: Lecture Nested Dissection Ordering  Global graph partitioning approach: top-down, divide-and-conquer  Nested dissection [George ’73, Lipton/Rose/Tarjan ’79]  First level  Recurse on A and B  Goal: find the smallest possible separator S at each level  Multilevel schemes [Hendrickson/Leland `94, Karypis/Kumar `95]  Spectral bisection [Simon et al. `90-`95]  Geometric and spectral bisection [Chan/Gilbert/Teng `94] ABS

CS267: Lecture Ordering Based on Graph Partitioning

CS267: Lecture Ordering for LU (unsymmetric)  Can use a symmetric ordering on a symmetrized matrix...  Case of partial pivoting (sequential SuperLU): Use ordering based on A T A  If R T R = A T A and PA = LU, then for any row permutation P, struct(L+U)  struct(R T +R) [George/Ng `87]  Making R sparse tends to make L & U sparse...  Case of static pivoting (SuperLU_DIST): Use ordering based on A T +A  If R T R = A T +A and A = LU, then struct(L+U)  struct(R T +R)  Making R sparse tends to make L & U sparse...  Can find better ordering based solely on A, without symmetrization [Amestoy/Li/Ng `03]

CS267: Lecture Ordering for Unsymmetric Matrix  Still wide open...  Simple extension: symmetric ordering using A’+A  Greedy algorithms, graph partitioning, or hybrid  Problem: unsymmetric structure is not respected !  We developed an unsymmetric variant of “Min Degree” algorithm based solely on A [Amestoy/Li/Ng ’02] (a.k.a. Markowitz scheme)

CS267: Lecture Structural Gaussian Elimination - Unsymmetric Case c1r1 r2 c2 c3 Eliminate 1 r1 r2 c1 c2 c3 1 1 Bipartite graph After a vertex is eliminated, all the row & column vertices adjacent to it become fully connected – “bi-clique” (assuming diagonal pivot) The edges of the bi-clique are the potential fills (upper bound !) Eliminate 1 1 r1 r2 c1 c2 c3 1 r1 r2 c1 c2 c3

CS267: Lecture Results of Markowitz Ordering  Extend the QG model to bipartite quotient graph  Same asymptotic complexity as symmetric MD  Space is bounded by 2*(m + n)  Time is bounded by O(n * m)  For 50+ unsym. matrices, compared with MD on A’+A:  Reduction in fill: average 0.88, best 0.38  Reduction in f.p. operations: average 0.77, best 0.01  How about graph partitioning?  Use directed graph

CS267: Lecture Techniques to Reduce Memory Access & Communication Cost  Blocking to increase number of floating-point operations performed for each memory access  Aggregate small messages into one larger message  Reduce cost due to latency  Well done in LAPACK, ScaLAPACK  Dense and banded matrices  Adopted in the new generation sparse software  Performance much more sensitive to latency in sparse case

CS267: Lecture Blocking in Sparse GE  Benefits of Supernodes:  Permit use of Level 3 BLAS (e.g., matrix-matrix mult.)  Reduce inefficient indirect addressing.  Reduce symbolic time by traversing supernodal graph.  Exploit dense submatrices in L & U factors

CS267: Lecture Speedup Over Un-blocked Code  Matrices sorted in increasing #Flops/nonzeros  Up to 40% of machine peak on large sparse matrices on IBM RS6000/590, MIPS R8000, 25% on Alpha 21164

CS267: Lecture Parallel Task Scheduling for SMPs (in SuperLU_MT)  Elimination tree exhibits parallelism and dependencies Shared task queue initialized by leaves While ( there are more panels ) do panel := GetTask( queue ) (1) panel_symbolic_factor( panel )  Skip all BUSY descendant supernodes (2) panel_numeric_factor( panel )  Perform updates from all DONE supernodes  Wait for BUSY supernodes to become DONE (3) inner_factor( panel ) End while  Up to 25-30% machine peak, 20 processors, Cray C90/J90, SGI Origin

CS267: Lecture Parallelism from Separator Tree  Come from graph partitioning type of ordering

CS267: Lecture Matrix Distribution on Large Distributed-memory Machine  2D block cyclic recommended for many linear algebra algorithms  Better load balance, less communication, and BLAS-3 1D blocked1D cyclic 1D block cyclic 2D block cyclic

CS267: Lecture D Block Cyclic Layout for Sparse L and U (in SuperLU_DIST)  Better for GE scalability, load balance

CS267: Lecture Scalability and Isoefficiency Analysis  Model problem: matrix from 11 pt Laplacian on k x k x k (3D) mesh; Nested dissection ordering  N = k 3  Factor nonzeros : O(N 4/3 )  Number of floating-point operations : O(N 2 )  Total communication overhead : O(N 4/3  P) (assuming P processors arranged as grid)  Isoefficiency function: Maintain constant efficiency if Work increases proportionally with Overhead : This is equivalent to:  (Memory-processor relation)  Parallel efficiency can be kept constant if the memory-per-processor is constant, same as dense LU in ScaLPAPACK  (Work-processor relation)

CS267: Lecture Scalability  3D KxKxK cubic grids, scale N 2 = K 6 with P for constant work per processor  Achieved 12.5 and 21.2 Gflops on 128 processors  Performance sensitive to communication latency  Cray T3E latency: 3 microseconds ( ~ 2702 flops)  IBM SP latency: 8 microseconds ( ~ flops )

CS267: Lecture Irregular Matrices MatrixSourceSymmNnz(A)nz(L+U)Flops BBMATFluid flow.5438, M40.2M31.2G ECL32Device sim.9351,993.38M42.7M68.4G TWOTONECircuit sim.43120, M11.9M8.0G

CS267: Lecture Adoptions of SuperLU  Industrial  FEMLAB  HP Mathematical Library  Mathematica  NAG  Numerical Python  Academic/Lab:  In other ACTS Tools: PETSc, Hyper  M3D, NIMROD (simulate fusion reactor plasmas)  Omega3P (accelerator design, SLAC)  OpenSees (earthquake simluation, UCB)  DSpice (parallel circuit simulation, SNL)  Trilinos (object-oriented framework encompassing various solvers, SNL)  NIKE (finite element code for structural mechanics, LLNL)

CS267: Lecture Summary  Important kernel for science and engineering applications, used in practice on a regular basis  Good implementation on high-performance machines need a large set of tools from CS and NLA  Performance more sensitive to latency than dense case  Survey of other sparse direct solvers: “Eigentemplates” book  LL T, LDL T, LU

CS267: Lecture The End

CS267: Lecture Application 1: Quantum Mechanics  Scattering in a quantum system of three charged particles  Simplest example is ionization of a hydrogen atom by collision with an electron: e - + H  H + + 2e -  Seek the particles’ wave functions represented by the time-independent Schrodinger equation  First solution to this long-standing unsolved problem [Recigno, McCurdy, et al. Science, 24 Dec 1999]

CS267: Lecture Quantum Mechanics (cont.)  Finite difference leads to complex, unsymmetric systems, very ill-conditioned  Diagonal blocks have the structure of 2D finite difference Laplacian matrices Very sparse: nonzeros per row <= 13  Off-diagonal block is a diagonal matrix  Between 6 to 24 blocks, each of order between 200K and 350K  Total dimension up to 8.4 M  Too much fill if use direct method...

CS267: Lecture SuperLU_DIST as Preconditioner  SuperLU_DIST as block-diagonal preconditioner for CGS iteration M -1 A x = M -1 b M = diag(A 11, A 22, A 33, …)  Run multiple SuperLU_DIST simultaneously for diagonal blocks  No pivoting, nor iterative refinement  12 to 35 CGS 1 ~ 2 minute/iteration using 64 IBM SP processors  Total time: 0.5 to a few hours

CS267: Lecture One Block Timings on IBM SP  Complex, unsymmetric  N = 2 M, NNZ = 26 M  Fill-ins using Metis: 1.3 G (50x fill)  Factorization speed  10x speedup (4 to 128 P)  Up to 30 Gflops

CS267: Lecture Application 2: Accelerator Cavity Design  Calculate cavity mode frequencies and field vectors  Solve Maxwell equation in electromagnetic field  Omega3P simulation code developed at SLAC Omega3P model of a 47-cell section of the 206-cell Next Linear Collider accelerator structure Individual cells used in accelerating structure

CS267: Lecture Accelerator (cont.)  Finite element methods lead to large sparse generalized eigensystem K x = M x  Real symmetric for lossless cavities; Complex symmetric when lossy in cavities  Seek interior eigenvalues (tightly clustered) that are relatively small in magnitude

CS267: Lecture Accelerator (cont.)  Speed up Lanczos convergence by shift-invert  Seek largest eigenvalues, well separated, of the transformed system M (K -  M) -1 x =  M x  = 1 / ( -  )  The Filtering algorithm [Y. Sun]  Inexact shift-invert Lanczos + JOCC (Jacobi Orthogonal Component Correction)  We added exact shift-invert Lanczos (ESIL)  PARPACK for Lanczos  SuperLU_DIST for shifted linear system  No pivoting, nor iterative refinement

CS267: Lecture DDS47, Linear Elements  Total eigensolver time: N = 1.3 M, NNZ = 20 M

CS267: Lecture Largest Eigen Problem Solved So Far  DDS47, quadratic elements  N = 7.5 M, NNZ = 304 M  6 G fill-ins using Metis  24 processors (8x3)  Factor: 3,347 s  1 Solve: 61 s  Eigensolver: 9,259 s (~2.5 hrs)  10 eigenvalues, 1 shift, 55 solves