Lecture 5 Parallel Sparse Factorization, Triangular Solution

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
MATH 685/ CSI 700/ OR 682 Lecture Notes
Sparse Matrices in Matlab John R. Gilbert Xerox Palo Alto Research Center with Cleve Moler (MathWorks) and Rob Schreiber (HP Labs)
OpenFOAM on a GPU-based Heterogeneous Cluster
Symmetric Minimum Priority Ordering for Sparse Unsymmetric Factorization Patrick Amestoy ENSEEIHT-IRIT (Toulouse) Sherry Li LBNL/NERSC (Berkeley) Esmond.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Sparse Direct Methods on High Performance Computers X. Sherry Li CS267/E233: Applications of Parallel Computing.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization.
6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
The Evolution of a Sparse Partial Pivoting Algorithm John R. Gilbert with: Tim Davis, Jim Demmel, Stan Eisenstat, Laura Grigori, Stefan Larimore, Sherry.
Sparse Direct Solvers on High Performance Computers X. Sherry Li CS267: Applications of Parallel Computers March.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
CS 290H Lecture 12 Column intersection graphs, Ordering for sparsity in LU with partial pivoting Read “Computing the block triangular form of a sparse.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Symbolic sparse Gaussian elimination: A = LU
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Amesos Sparse Direct Solver Package Ken Stanley, Rob Hoekstra, Marzio Sala, Tim Davis, Mike Heroux Trilinos Users Group Albuquerque 3 Nov 2004.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Lecture 4 Sparse Factorization: Data-flow Organization
Direct Methods for Sparse Linear Systems Lecture 4 Alessandra Nardi Thanks to Prof. Jacob White, Suvranu De, Deepak Ramaswamy, Michal Rewienski, and Karen.
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Administrivia: October 5, 2009 Homework 1 due Wednesday Reading in Davis: Skim section 6.1 (the fill bounds will make more sense next week) Read section.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
1 1  Capabilities: Serial (C), shared-memory (OpenMP or Pthreads), distributed-memory (hybrid MPI+ OpenM + CUDA). All have Fortran interface. Sparse LU.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections
1 1  Capabilities: Serial (thread-safe), shared-memory (SuperLU_MT, OpenMP or Pthreads), distributed-memory (SuperLU_DIST, hybrid MPI+ OpenM + CUDA).
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
Report from LBNL TOPS Meeting TOPS/ – 2Investigators  Staff Members:  Parry Husbands  Sherry Li  Osni Marques  Esmond G. Ng 
Parallel Multifrontal Sparse Solvers Information Sciences Institute 22 June 2010 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes
CS 290H Lecture 9 Left-looking LU with partial pivoting Read “A supernodal approach to sparse partial pivoting” (course reader #4), sections 1 through.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Symmetric-pattern multifrontal factorization T(A) G(A)
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
1 Des bulles dans le PaStiX Réunion NUMASIS Mathieu Faverge ScAlApplix project, INRIA Futurs Bordeaux 29 novembre 2006.
A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
CS 290H Administrivia: April 16, 2008
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
The Landscape of Sparse Ax=b Solvers
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Dense Linear Algebra (Data Distributions)
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

Lecture 5 Parallel Sparse Factorization, Triangular Solution 4th Gene Golub SIAM Summer School, 7/22 – 8/7, 2013, Shanghai Lecture 5 Parallel Sparse Factorization, Triangular Solution Xiaoye Sherry Li Lawrence Berkeley National Laboratory, USA xsli@lbl.gov crd-legacy.lbl.gov/~xiaoye/G2S3/

Lecture outline Shared-memory Distributed-memory Distributed-memory triangular solve Collection of sparse codes, sparse matrices

SuperLU_MT [Li, Demmel, Gilbert] Pthreads or OpenMP Left-looking – relatively more READs than WRITEs Use shared task queue to schedule ready columns in the elimination tree (bottom up) Over 12x speedup on conventional 16-CPU SMPs (1999) P1 P2 DONE NOT TOUCHED WORKING U L A P1 P2 DONE WORKING No notion of processor assignment, as long as do not violate the dependencies RESULTS: vecpar 2008 paper

Benchmark matrices apps dim nnz(A) SLU_MT Fill SLU_DIST Avg. S-node g7jac200 Economic model 59,310 0.7 M 33.7 M 1.9 stomach 3D finite diff. 213,360 3.0 M 136.8 M 137.4 M 4.0 torso3 259,156 4.4 M 784.7 M 785.0 M 3.1 twotone Nonlinear analog circuit 120,750 1.2 M 11.4 M 2.3 SLU_MT “Symmetric Mode” ordering on A+A’ pivot on main diagonal

Multicore platforms Intel Clovertown (Xeon 53xx) 2.33 GHz Xeon, 9.3 Gflops/core 2 sockets x 4 cores/socket L2 cache: 4 MB/2 cores Sun Niagara 2 (UltraSPARC T2): 1.4 GHz UltraSparc T2, 1.4 Gflops/core 2 sockets x 8 cores/socket x 8 hardware threads/core L2 cache shared: 4 MB Reasons: left vs. right-looking (read vs. write), write bandwidth is half of read Niagara 2 == VictoriaFalls

Intel Clovertown, Sun Niagara 2 Maximum speed up 4.3 (Intel), 20 (Sun) Question: tools to analyze resource contention? The overhead of the scheduling algorithm, synchronization using mutexes (locks) are small.

Matrix distribution on large distributed-memory machine 2D block cyclic recommended for many linear algebra algorithms Better load balance, less communication, and BLAS-3 1D blocked 1D cyclic 1D block cyclic 2D block cyclic

2D Block Cyclic Distr. for Sparse L & U SuperLU_DIST : C + MPI Right-looking – relatively more WRITEs than READs 2D block cyclic layout Look-ahead to overlap comm. & comp. Scales to 1000s processors 2 3 4 1 5 Matrix ACTIVE 2 3 4 1 5 Process(or) mesh

SuperLU_DIST: GE with static pivoting [Li, Demmel, Grigori, Yamazaki] Target: Distributed-memory multiprocessors Goal: No pivoting during numeric factorization Permute A unsymmetrically to have large elements on the diagonal (using weighted bipartite matching) Scale rows and columns to equilibrate Permute A symmetrically for sparsity Factor A = LU with no pivoting, fixing up small pivots: if |aii| < ε · ||A|| then replace aii by ε1/2 · ||A|| Solve for x using the triangular factors: Ly = b, Ux = y Improve solution by iterative refinement

Row permutation for heavy diagonal [Duff, Koster] 1 2 3 4 5 1 5 2 3 4 PA 1 5 2 3 4 1 2 3 4 5 A Represent A as a weighted, undirected bipartite graph (one node for each row and one node for each column) Find matching (set of independent edges) with maximum product of weights Permute rows to place matching on diagonal Matching algorithm also gives a row and column scaling to make all diag elts =1 and all off-diag elts <=1

SuperLU_DIST: GE with static pivoting [Li, Demmel, Grigori, Yamazaki] Target: Distributed-memory multiprocessors Goal: No pivoting during numeric factorization Permute A unsymmetrically to have large elements on the diagonal (using weighted bipartite matching) Scale rows and columns to equilibrate Permute A symmetrically for sparsity Factor A = LU with no pivoting, fixing up small pivots: if |aii| < ε · ||A|| then replace aii by ε1/2 · ||A|| Solve for x using the triangular factors: Ly = b, Ux = y Improve solution by iterative refinement

SuperLU_DIST: GE with static pivoting [Li, Demmel, Grigori, Yamazaki] Target: Distributed-memory multiprocessors Goal: No pivoting during numeric factorization Permute A unsymmetrically to have large elements on the diagonal (using weighted bipartite matching) Scale rows and columns to equilibrate Permute A symmetrically for sparsity Factor A = LU with no pivoting, fixing up small pivots: if |aii| < ε · ||A|| then replace aii by ε1/2 · ||A|| Solve for x using the triangular factors: Ly = b, Ux = y Improve solution by iterative refinement

SuperLU_DIST steps to solution Matrix preprocessing static pivoting/scaling/permutation to improve numerical stability and to preseve sparsity Symbolic factorization compute e-tree, structure of LU, static comm. & comp. scheduling find supernodes (6-80 cols) for efficient dense BLAS operations Numerical factorization (dominate) Right-looking, outer-product 2D block-cyclic MPI process grid Triangular solve with forward, back substitutions 2x3 process grid

SuperLU_DIST right-looking factorization for j = 1, 2, . . . , Ns (# of supernodes) // panel factorization (row and column) - factor A(j,j)=L(j,j)*U(j,j), and ISEND to PC(j) and PR(j) - WAIT for Lj,j and factor row Aj, j+1:Ns and SEND right to PC (:) - WAIT for Uj,j and factor column Aj+1:Ns, j and SEND down to PR(:) // trailing matrix update - update Aj+1:Ns, j+1:Ns end for Scalability bottleneck: Panel factorization has sequential flow and limited parallelism. All processes wait for diagonal factorization & panel factorization 2x3 process grid Implementation uses flexible look-ahead Trailing matrix update has high parallelism, good load-balance, but time-consuming

SuperLU_DIST 2.5 on Cray XE6 Profiling with IPM Synchronization dominates on a large number of cores up to 96% of factorization time Accelerator (sym), n=2.7M, fill-ratio=12 DNA, n = 445K, fill-ratio= 609

Look-ahead factorization with window size nw for j = 1, 2, . . . , Ns (# of supernodes) // look-ahead row factorization for k = j+1 to j+nw do if (Lk,k has arrived) factor Ak,(k+1):Ns and ISEND to PC(:) end for // synchronization - factor Aj,j =Lj,jUj,j, and ISEND to PC(j) and PR(j) - WAIT for Lj,j and factor row Aj, j+1:Ns - WAIT for L:, j and Uj, : // look-ahead column factorization update A:,k if ( A:,k is ready ) factor Ak:Ns,k and ISEND to PR(:) // trailing matrix update - update remaining A j+nw+1:Ns, j+nw+1:Ns At each j-th step, factorize all “ready” panels in the window reduce idle time; overlap communication with computation; exploit more parallelism Implementation uses flexible look-ahead Trailing matrix update has high parallelism, good load-balance, but time-consuming

Expose more “Ready” panels in window Schedule tasks with better order as long as tasks dependencies are respected Dependency graphs: LU DAG: all dependencies Transitive reduction of LU DAG: smallest graph, removed all redundant edges, but expensive to compute Symmetrically pruned LU DAG (rDAG): in between LU DAG and its transitive reduction, cheap to compute Elimination tree (e-tree): symmetric case: e-tree = transitive reduction of Cholesky DAG, cheap to compute unsymmetric case: e-tree of |A|T+|A|, cheap to compute

Example: reordering based on e-tree Window size = 5 Postordering based on depth-first search Bottomup level-based ordering

SuperLU_DIST 2.5 and 3.0 on Cray XE6 Accelerator (sym), n=2.7M, fill-ratio=12 DNA, n = 445K, fill-ratio= 609 Idle time was significantly reduced (speedup up to 2.6x) To further improve performance: more sophisticated scheduling schemes hybrid programming paradigms

Examples Name Application Data type N |A| / N Sparsity |L\U| (10^6) Fill-ratio g500 Quantum Mechanics (LBL) Complex 4,235,364 13 3092.6 56.2 matrix181 Fusion, MHD eqns (PPPL) Real 589,698 161 888.1 9.3 dds15 Accelerator, Shape optimization (SLAC) 834,575 16 526.6 40.2 matick Circuit sim. MNA method (IBM) 16,019 4005 64.3 1.0 IBM matick: Modified Nodal Analysis (MNA) technique. Very different from Spice matrices Sparsity-preserving ordering: MeTis applied to structure of A’+A

Performance on IBM Power5 (1.9 GHz) From first sight, solve is less scalable. Solve time is usually < 5%, but scales poorly Up to 454 Gflops factorization rate

Performance on IBM Power3 (375 MHz) Quantum mechanics, complex

Distributed triangular solution Challenge: higher degree of dependency 1 2 3 4 5 + Process mesh Diagonal process computes the solution VECPAR 2008 talk

Parallel triangular solution Clovertown: 8 cores; IBM Power5: 8 cpus/node OLD code: many MPI_Reduce of one integer each, accounting for 75% of time on 8 cores NEW code: change to one MPI_Reduce of an array of integers Scales better on Power5

MUMPS: distributed-memory multifrontal [Current team: Amestoy, Buttari, Guermouche, L‘Excellent, Uçar] Symmetric-pattern multifrontal factorization Parallelism both from tree and by sharing dense ops Dynamic scheduling of dense op sharing Symmetric preordering For nonsymmetric matrices: optional weighted matching for heavy diagonal expand nonzero pattern to be symmetric numerical pivoting only within supernodes if possible (doesn’t change pattern) failed pivots are passed up the tree in the update matrix

Collection of software, test matrices Survey of different types of direct solver codes http://crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdf LLT (s.p.d.) LDLT (symmetric indefinite) LU (nonsymmetric) QR (least squares) Sequential, shared-memory, distributed-memory, out-of-core Accelerators such as GPU, FPGA become active, have papers, no public code yet The University of Florida Sparse Matrix Collection http://www.cise.ufl.edu/research/sparse/matrices/

References X.S. Li, “An Overview of SuperLU: Algorithms, Implementation, and User Interface”, ACM Transactions on Mathematical Software, Vol. 31, No. 3, 2005, pp. 302-325. X.S. Li and J. Demmel, “SuperLU_DIST: A Scalable Distributed-memory Sparse Direct Solver for Unsymmetric Linear Systems”, ACM Transactions on Mathematical Software, Vol. 29, No. 2, 2003, pp. 110-140. X.S. Li, “Evaluation of sparse LU factorization and triangular solution on multicore platforms”, VECPAR'08, June 24-27, 2008, Toulouse. I. Yamazaki and X.S. Li, “New Scheduling Strategies for a Parallel Right-looking Sparse LU Factorization Algorithm on Multicore Clusters”, IPDPS 2012, Shanghai, China, May 21-25, 2012. L. Grigori, X.S. Li and J. Demmel, “Parallel Symbolic Factorization for Sparse LU with Static Pivoting”. SIAM J. Sci. Comp., Vol. 29, Issue 3, 1289-1314, 2007. P.R. Amestoy, I.S. Duff, J.-Y. L'Excellent, and J. Koster, “A fully asynchronous multifrontal solver using distributed dynamic scheduling”, SIAM Journal on Matrix Analysis and Applications, 23(1), 15-41 (2001). P. Amestoy, I.S. Duff, A. Guermouche, and T. Slavova. Analysis of the Solution Phase of a Parallel Multifrontal Approach. Parallel Computing, No 36, pages 3-15, 2009. A. Guermouche, J.-Y. L'Excellent, and G.Utard, Impact of reordering on the memory of a multifrontal solver. Parallel Computing, 29(9), pages 1191-1218. F.-H. Rouet, Memory and Performance issues in parallel multifrontal factorization and triangular solutions with sparse right-hand sides, PhD Thesis, INPT, 2012. P. Amestoy, I.S. Duff, J-Y. L'Excellent, X.S. Li, “Analysis and Comparison of Two General Sparse Solvers for Distributed Memory Computers”, ACM Transactions on Mathematical Software, Vol. 27, No. 4, 2001, pp. 388-421.

Exercises Download and install SuperLU_MT on your machine, then run the examples in EXAMPLE/ directory. Run the examples in SuperLU_DIST_3.3 directory.