Lecture 4 Sparse Factorization: Data-flow Organization

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111.
Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
Lecture 3 Sparse Direct Method: Combinatorics Xiaoye Sherry Li Lawrence Berkeley National Laboratory, USA crd-legacy.lbl.gov/~xiaoye/G2S3/
ECE 552 Numerical Circuit Analysis Chapter Four SPARSE MATRIX SOLUTION TECHNIQUES Copyright © I. Hajj 2012 All rights reserved.
CS 240A: Solving Ax = b in parallel Dense A: Gaussian elimination with partial pivoting (LU) Same flavor as matrix * matrix, but more complicated Sparse.
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. GPU Sparse LU Factorization and Its Application.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Sparse Matrices in Matlab John R. Gilbert Xerox Palo Alto Research Center with Cleve Moler (MathWorks) and Rob Schreiber (HP Labs)
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Solution of linear system of equations
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Lecture 21: Parallel Algorithms
CS 290H: Sparse Matrix Algorithms
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 17 Solution of Systems of Equations.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods Nonsymmetric systems Graph theoretic tools Sparse LU with partial pivoting Supernodal factorization.
Sparse Matrix Methods Day 1: Overview Matlab and examples Data structures Ax=b Sparse matrices and graphs Fill-reducing matrix permutations Matching and.
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 15 Solution of Systems of Equations.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
The Evolution of a Sparse Partial Pivoting Algorithm John R. Gilbert with: Tim Davis, Jim Demmel, Stan Eisenstat, Laura Grigori, Stefan Larimore, Sherry.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
CS 290H Lecture 12 Column intersection graphs, Ordering for sparsity in LU with partial pivoting Read “Computing the block triangular form of a sparse.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
Symbolic sparse Gaussian elimination: A = LU
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Lecture 5 Parallel Sparse Factorization, Triangular Solution
The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.
CS 290H Lecture 5 Elimination trees Read GLN section 6.6 (next time I’ll assign 6.5 and 6.7) Homework 1 due Thursday 14 Oct by 3pm turnin file1.
Efficiency and Flexibility of Jagged Arrays Geir Gundersen Department of Informatics University of Bergen Norway Joint work with Trond Steihaug.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
JAVA AND MATRIX COMPUTATION
Solution of Sparse Linear Systems
CS240A: Conjugate Gradients and the Model Problem.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Administrivia: October 5, 2009 Homework 1 due Wednesday Reading in Davis: Skim section 6.1 (the fill bounds will make more sense next week) Read section.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
CS 290H Lecture 9 Left-looking LU with partial pivoting Read “A supernodal approach to sparse partial pivoting” (course reader #4), sections 1 through.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Symmetric-pattern multifrontal factorization T(A) G(A)
Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Numerical Computation Lecture 6: Linear Systems – part II United International College.
Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.
Lecture 3 Sparse Direct Method: Combinatorics
Model Problem: Solving Poisson’s equation for temperature
CS 290N / 219: Sparse Matrix Algorithms
A computational loop k k Integration Newton Iteration
Chapter 6 Transform-and-Conquer
CS 290H Administrivia: April 16, 2008
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
The Landscape of Sparse Ax=b Solvers
Sparse LA Sathish Vadhiyar.
Numerical Algorithms • Parallelizing matrix multiplication
CS 290H Lecture 3 Fill: bounds and heuristics
Dense Linear Algebra (Data Distributions)
A computational loop k k Integration Newton Iteration
Nonsymmetric Gaussian elimination
Presentation transcript:

Lecture 4 Sparse Factorization: Data-flow Organization 4th Gene Golub SIAM Summer School, 7/22 – 8/7, 2013, Shanghai Lecture 4 Sparse Factorization: Data-flow Organization Xiaoye Sherry Li Lawrence Berkeley National Laboratory, USA xsli@lbl.gov crd-legacy.lbl.gov/~xiaoye/G2S3/ July 13, TNLIST, Tsinghua Univ.

Lecture outline Dataflow organization: left-looking, right-looking Blocking for high performance Supernode, multifrontal Triangular solution Gilbert lecture (DAG model), my vecpar paper, refer to DAG model

Dense Cholesky Right-looking Cholesky for k = 1,…,n do Left-looking Cholesky for k = 1,…,n do for i = k,…,n do for j = 1,…k-1 do end for for i = k+1,…,n do Right-looking Cholesky for k = 1,…,n do for i = k+1,…,n do for j = k+1,…,i do end for Add pictures from Sparse.pdf

Sparse Cholesky G(A) T(A) Reference case: regular 3 x 3 grid ordered by nested dissection. Nodes in the separators are ordered last Notation: cdiv(j) : divide column j by a scalar cmod(j, k) : update column j with column k struct(L(1:k), j)) : the structure of L(1:k, j) submatrix 5 9 6 7 8 1 2 3 4 A 9 1 2 3 4 6 7 8 5 G(A) T(A) 1 2 3 4 6 7 8 9 5 Picture from Sparse.pdf

Sparse left-looking Cholesky for j = 1 to n do for k in struct(L(j, 1 : j-1)) do cmod(j, k) end for cdiv(j) Before variable j is eliminated, column j is updated with all the columns that have a nonzero on row j. In the example above, struct(L(7,1:6)) = {1; 3; 4; 6}. This corresponds to receiving updates from nodes lower in the subtree rooted at j The filled graph is necessary to determine the structure of each row

Sparse right-looking Cholesky for k = 1 to n do cdiv(k) for j in struct(L(k+1 : n, k)) do cmod(j,k) end for After variable k is eliminated, column k is used to update all the columns corresponding to nonzeros in column k. In the example above, struct(L(4:9,3))={7; 8; 9}. This corresponds to sending updates to nodes higher in the elimination tree The filled graph is necessary to determine the structure of each column

Sparse LU U L A U L A Right-looking: many more writes than reads Left-looking: many more reads than writes U(1:j-1, j) = L(1:j-1, 1:j-1) \ A(1:j-1, j) for j = 1 to n do for k in struct(U(1:j-1, j)) do cmod(j, k) end for cdiv(j) Right-looking: many more writes than reads for k = 1 to n do cdiv(k) for j in struct(U(k, k+1:n)) do cmod(j, k) end for DONE NOT TOUCHED U L A ACTIVE j DONE ACTIVE U L A j

High Performance Issues: Reduce Cost of Memory Access & Communication Blocking to increase number of floating-point operations performed for each memory access Aggregate small messages into one larger message Reduce cost due to latency Well done in LAPACK, ScaLAPACK Dense and banded matrices Adopted in the new generation sparse software Performance much more sensitive to latency in sparse case

Blocking: supernode Use (blocked) CRS or CCS, and any ordering method Leave room for fill-ins ! (symbolic factorization) Exploit “supernodal” (dense) structures in the factors Can use Level 3 BLAS Reduce inefficient indirect addressing (scatter/gather) Reduce graph traversal time using a coarser graph

Nonsymmetric supernodes 1 2 3 4 5 6 10 7 8 9 Factors L+U 1 2 3 4 5 6 10 7 8 9 Original matrix A

SuperLU speedup over unblocked code Sorted in increasing “reuse ratio” = #Flops/nonzeros ~ Arithmetic Intensity Up to 40% of machine peak on large sparse matrices on IBM RS6000/590, MIPS R8000

Symmetric-pattern multifrontal factorization [John Gilbert’s lecture] 9 1 2 3 4 6 7 8 5 G(A) 5 9 6 7 8 1 2 3 4 A T(A) 1 2 3 4 6 7 8 9 5

Symmetric-pattern multifrontal factorization 9 1 2 3 4 6 7 8 5 G(A) For each node of T from leaves to root: Sum own row/col of A with children’s Update matrices into Frontal matrix Eliminate current variable from Frontal matrix, to get Update matrix Pass Update matrix to parent T(A) 1 2 3 4 6 7 8 9 5

Symmetric-pattern multifrontal factorization 9 1 2 3 4 6 7 8 5 G(A) For each node of T from leaves to root: Sum own row/col of A with children’s Update matrices into Frontal matrix Eliminate current variable from Frontal matrix, to get Update matrix Pass Update matrix to parent T(A) 1 2 3 4 6 7 8 9 5 1 3 7 F1 = A1 => U1

Symmetric-pattern multifrontal factorization 9 1 2 3 4 6 7 8 5 G(A) For each node of T from leaves to root: Sum own row/col of A with children’s Update matrices into Frontal matrix Eliminate current variable from Frontal matrix, to get Update matrix Pass Update matrix to parent T(A) 1 2 3 4 6 7 8 9 5 1 3 7 F1 = A1 => U1 2 3 9 F2 = A2 => U2

Symmetric-pattern multifrontal factorization 9 1 2 3 4 6 7 8 5 G(A) 3 7 8 9 F3 = A3+U1+U2 => U3 1 2 3 4 6 7 8 9 5 T(A) 1 3 7 F1 = A1 => U1 2 3 9 F2 = A2 => U2

Symmetric-pattern multifrontal factorization 9 1 2 3 4 6 7 8 5 G+(A) 5 9 6 7 8 1 2 3 4 L+U T(A) 1 2 3 4 6 7 8 9 5

Symmetric-pattern multifrontal factorization variant of right-looking Really uses supernodes, not nodes All arithmetic happens on dense square matrices. Needs extra memory for a stack of pending update matrices Potential parallelism: between independent tree branches parallel dense ops on frontal matrix 1 2 3 4 6 7 8 9 5 G(A) T(A) 1 2 3 4 6 7 8 9 5

Sparse triangular solution Forward substitution for x = L \ b (back substitution for x = U \ b) Row-oriented = dot-product = left-looking for i = 1 to n do x(i) = b(i); // dot-product for j = 1 to i-1 do x(i) = x(i) – L(i, j) * x(j); end for x(i) = x(i) / L(i, i); end for

Sparse triangular solution: x = L \ b column-oriented = saxpy = right-looking Either way works in O(nnz(L)) time x(1:n) = b(1:n); for j = 1 to n do x(j) = x(j) / L(j, j); // saxpy x(j+1:n) = x(j+1:n) – L(j+1:n, j) * x(j); end for

Sparse right-hand side: x = L \ b, b sparse Use Directed Acyclic Graph (DAG) 1 2 3 4 7 6 5 G(A) A If A is triangular, G(A) has no cycles Lower triangular => edges directed from higher to lower #s Upper triangular => edges directed from lower to higher #s

Sparse right-hand side: x = L \ b, b sparse b is sparse, x is also sparse, but may have fill-ins 1 5 2 3 4 = 1 2 3 4 5 x L b G(LT) Symbolic: Predict structure of x by depth-first search from nonzeros of b Numeric: Compute values of x in topological order Time = O(flops)

Recall: left-looking sparse LU sparse right-hand side U(1:j-1, j) = L(1:j-1, 1:j-1) \ A(1:j-1, j) for j = 1 to n d for k in struct(U(1:j-1, j)) do cmod(j, k) end for cdiv(j) DONE NOT TOUCHED U L A ACTIVE j Used in symbolic factorization to find nonzeros in column j

References M.T. Heath, E. Ng., B.W. Peyton, “Parallel Algorithms for Sparse Linear Systems”, SIAM Review, Vol. 33 (3), pp. 420-460, 1991. E. Rothberg and A. Gupta, “Efficient Sparse Matrix Factorization on High-Performance Workstations--Exploiting the Memory Hierarchy”, ACM. Trans. Math, Software, Vol. 17 (3), pp. 313-334, 1991 E. Rothberg and A. Gupta, “An Efficient Block-Oriented Approach to Parallel Sparse Cholesky Factorization, SIAM J. Sci. Comput., Vol. 15 (6), pp. 1413-1439, 1994. J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W.H. Liu, “A Supernodal Approach to Sparse Partial Pivoting'’, SIAM J. Matrix Analysis and Applications, vol. 20 (3), pp. 720-755, 1999. J. W. H. Liu, “The Multifrontal Method for Sparse Matrix Solution: theory and Practice”, SIAM Review, Vol. 34 (1), pp. 82-109, 1992.

Exercises Study and run the OpenMP code of dense LU factorization in Hands-On-Exercises/ directory