Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.

Sparse LA Sathish Vadhiyar

Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many physical systems produce sparse matrices  Most of the research and the base case in sparse symmetric positive definite matrices

Sparse Cholesky  To solve Ax = b;  A = LL T ; Ly = b; L T x = y;  Cholesky factorization introduces fill-in

Column oriented left-looking Cholesky

Fill-in 10 1 3 2 4 5 6 7 8 9 1 3 2 4 5 6 7 8 9 Fill: new nonzeros in factor

Permutation Matrix or Ordering  Thus ordering to reduce fill or to enhance numerical stability  Choose permutation matrix P so that Cholesky factor L’ of PAP T has less fill than L.  Triangular solve: L’y = Pb; L’ T z = y; x = P T z  The fill can be predicted in advance  Static data structure can be used – symbolic factorization

Steps  Ordering: Find a permutation P of matrix A,  Symbolic factorization: Set up a data structure for the Cholesky factor L of PAP T,  Numerical factorization: Decompose PAP T into LL T,  Triangular system solution: Ly = Pb; L T z = y; x = P T z.

Sparse Matrices and Graph Theory 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 3 4 5 6 7 4 5 6 7 G(A)

Sparse and Graph 1 2 3 4 5 6 7 5 6 7 1 2 3 4 5 6 7 6 7 1 2 3 4 5 6 7 7 F(A)

Ordering  The above order of elimination is “natural”  The first heuristic is minimum degree ordering  Simple and effective  But efficiency depends on tie breaking strategy

Minimum degree ordering for the previous Matrix 1 2 3 4 5 6 7 Ordering – {2,4,5,7,3,1,6} No fill-in ! 6 1 5 2 3 7 4

Ordering  Another ordering is nested dissection (divide-and-conquer)  Find separator S of nodes whose removal (along with edges) divides the graph into 2 disjoint pieces  Variables in each pieces are numbered contiguously and variables in S are numbered last  Leads to bordered block diagonal non-zero pattern  Can be applied recursively

Nested Dissection Illustration 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 S

Nested Dissection Illustration 1 3 5 7 9 2 4 6 8 10 21 22 23 24 25 11 13 15 17 19 12 14 16 18 20 S

Symbolic factorization  Can simulate the numerical factorization  Struct(M i* ) := {k<i | m ik ≠ 0}, Struct(M *j ) := {k>j | m kj ≠ 0}, p(j) := {min{i Є Struct(L *j )}, if Struct(L *j ) ≠ 0, j otherwise}  Struct(L *j ) C Struct(L *p(j) )) U {p(j}}, Struct(L *j ) := Struct(A *j ) U (U i<j {Struct(L *i )|p(i) = j}) – {j}

Symbolic Factorization for j:= 1 to n do R j := 0 for j:= 1 to n do S := Struct(A *j ) for i Є R j do S: = S U Struct(L *i ) – {j} Struct(L *j ) := S if Struct(L *j ) ≠ 0 then p(j) := min{i Є Struct(L *j )} R p(j) := R p(j) U {j}

Numerical Factorization cmod(j, k): modification of column j by column k, k < j cdiv(j) : division of column j by a scalar

Algorithms

Elimination Tree  T(A) has an edge between two vertices i and j, with i > j, if i = p(j). i is the parent of j. 1 3 2 45 6 7 8 9 10

Supernode  A set of contiguous columns in the Cholesky factor L that share essentially the same sparsity structure.  The set of contiguous columns j,j+1,…,j+t constitutes a supernode if Struct(L *,k ) = Struct(L *,k+1 ) U {k+1} for j<=k<=j+t-1  Columns in the same supernode can be treated as a unit  Used for enhancing the efficiency of minimum degree ordering and symbolic factorization.

Parallelization of Sparse Cholesky  Most of the parallel algorithms are based on elimination trees  Work associated with two disjoint subtrees can proceed independently  Same steps associated with sequential sparse factorization  One additional step: assignment of tasks to processors

Ordering  2 issues: - ordering in parallel - find an ordering that will help in parallelization in the subsequent steps

Ordering in Parallel – Nested dissection  Nested dissection can be carried in parallel  Also leads to elimination trees that can be parallelized during subsequent factorizations  But parallelization only in the later levels of dissection  Can be applied to only limited class of problems  More later….

Ordering for Parallel Factorization 1 2 3 4 5 6 7 Natural order 1 2 3 4 5 6 7 Elimination Tree No fill. No scope for parallelization No agreed objective for ordering for parallel factorization: Not all orderings that reduce fill-in can provide scope for parallelization

Example (Contd..) 1 2 3 4 5 6 7 Nested dissection order 1 Elimination Tree Fill. But scope for parallelization 2 3 45 6 7 5 6 4 7 2 3 1

Ordering for parallel factorization – Tree restructuring  Decouple fill reducing ordering and ordering for parallel elimination  Determine a fill reducing ordering P of G(A)  Form the elimination tree T(PAP T )  Transform this tree T(PAP T ) to one with smaller height and record the corresponding equivalent reordering, P’

Ordering for parallel factorization – Tree restructuring  Efficiency depends on if such an equivalent reordering can be found  Also on the limitations of the initial ordering, P  Only minor modifications to the initial ordering. Hence only limited improvement in parallelism  Algorithm by Liu (1989) based on elimination tree rotation to reduce the height  Algorithm by Jess and Kees (1982) based on chordal graph to reduce the height

Height and Parallel Completion time  Not all elimination trees with minimum heights give rise to small parallel completion times.  Let each node, v, in elimination tree associated with x,y  x – time[v] or time for factorization of column v  y – level[v]  level[v] = time[v] if v is the root of elimination tree, time[v]+level[parent of v] otherwise Represents minimum time to completion starting at node v  Parallel completion time – maximum level value among all nodes

Height and Parallel Completion time a b c d e f g h i e d c b a f g h i f e d c b g h i a 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 2, 14 3, 12 3, 9 3, 6 4, 24 6, 20 6, 14 5, 8 3,3 2, 19 3, 17 3, 14 3, 11 3, 8 4, 21 6, 17 6, 11 5, 5

Minimization of Cost  Thus some recent algorithms (Weng- Yang Lin, J. of Supercomputing: 2003) pick at each step, the nodes with the minimum cost (greedy approach)

Nested Dissection Algorithms  Use a graph partitioning heuristic to obtain a small edge separator of the graph  Transform the small edge separator into a small node separator  Number nodes of the separator last and recursively apply

Algorithm 1 - Level Structures  d(x, y) – distance between x and y  Eccentricity: ε(x) = max yin X {d(x, y)}  Diameter: δ(G) = max of eccentricities  Peripheral node: x in X whose ε(x) = δ(G)  Level structure: a partitioning L = {L 0,..,L l ) such that Adj(L i ) C L i-1 U L i+1

Example 216835 47 6 18 243 75 Level structure rooted at 6

Breadth First Search  One way of finding level structures is by BFS starting with the peripheral node  Finding peripheral node is expensive. Hence settle for pseudo-peripheral

Pseudo peripheral node 1.Pick arbitrary node r in X 2.Generate a level structure with ε(r) levels 3.Choose node x in last level with minimum degree 4.Generate a level structure rooted at x 5.If ε(x) > ε(r) r =x, go to step 3; else x is the pseudo peripheral 2 1 r1 1 2 2 2 r 1 23 2 3 4 4 4 3 21 3 2 1 r

ND Heuristic based on BFS 1.Construct a level structure with l levels 2.Form separator S of nodes in level (l+1)/2 3.Recursively apply

Example AB AB a b b a

K-L for ND  Form a random initial partition  Form edge separator by applying K-L to form partitions P1 and P2  Let V1 in P1 such that nodes in V1 incident on atleast one edge in the separator set. Similarly V2  V1 U V2 (wide node separator),  V1 or V2 (narrow node separator) by Gilbert and Zmijewski (1987)

Step 2: Mapping Problems on to processors  Based on elimination trees  But elimination trees are determined from the structure of L which happens in symbolic factorization (step 3) – bootstrapping problem !  Efficient algorithms exist to find elimination trees from the structure of A.  Parallel calculation of elimination tree by zmijewski and Gilbert where each processor computes “local” version of elimination tree and then the local versions are combined.  Various strategies to map columns to processors based on elimination trees.  Strategy 1: Successive levels in elimination tree are wrap- mapped onto processors  Strategy 2: Subtree-to-Subcube  Strategy 3: Bin-Pack by Geist and Ng

Strategy 1 2 1 0 3 2 1 00 33 22 11 0000 11 0000

Strategy 2 – Subtree-to-subcube mapping  Select an appropriate set of P subtrees of the elimination tree, say T0, T1…  Assign columns corresponding to Ti to Pi  Where two subtrees merge into a single subtree, their processor sets are merged together and wrap-mapped onto the nodes/columns of the separator that begins at that point.  The root separator is wrap-mapped onto the set of all processors.

Strategy 2 0 1 2 3 0 1 02 13 02 01 0011 23 2233

Strategy 3: Bin-Pack (Geist and Ng)  Try to find disjoint subtrees  Map the subtrees to p bins based on first-fit-decreasing bin- packing heuristic Subtrees are processed in decreasing order of workloads A subtree is packed into the current lightest bin  Weight imbalance, α – ratio between lightest and heaviest bin  If α >= user-specified tolerance, γ, stop  Else explore the heaviest subtree from the tree and split into subtrees and p bins are repacked using bin-packing again  Repeat until α >= γ or the largest subtree cannot be split further  Load balance based on user-specified tolerance  For the remaining nodes from the roots of the subtrees to the root of the tree, wrap map.

Parallel symbolic factorization  Sequential symbolic factorization is very efficient  Not able to achieve good speedups with parallel versions – Limited parallelism, small task sizes, high communication overhead  Mapping strategies typically wrap-map processors to the columns of same supernode – hence more storage and work than sequential version  For supernodal structure, only the processor holding the 1 st column of the supernode calculates the structure  Other processors holding other columns of supernode simply retrieve the structure from the processor holding the first column.

Parallel Numerical Factorization – Submatrix Cholesky Tsub(k) Tsub(k) is partitioned into various subtasks Tsub(k,1),…,Tsub(k,P) where Tsub(k,p) := {cmod(j,k) | j C Struct(L *k ) ∩ mycols(p)}

Definitions  mycols(p) – set of columns owned by p  map[k] – processor containing column k  procs(L *k ) = {map[j] | j in Struct(L *k )}

Parallel Submatrix Cholesky for j in mycols(p) do if j is a leaf node in T(A) do cdiv(j) send L *j to the processors in procs(L *j ) mycols(p) := mycols(p) – {j} while mycols(p) ≠ 0 do receive any column of L, say L *k for j in Struct(L *k ) ∩ mycols(p) do cmod(j, k) if column j required no more cmod ’ s do cdiv(j) send L *j to the processors in procs(L *j ) mycols(p) := mycols(p) – {j} Disadvantages: 1.Communication is not localized

Parallel Numerical Factorization – Sub column Cholesky Tcol(j) is partitioned into various subtasks Tcol(j,1),…,Tcol(j,P) where Tcol(j,p) aggregates into a single update vector every update vector u(j,k) for which k C Struct(L j* ) ∩ mycols(p)

Definitions  mycols(p) – set of columns owned by p  map[k] – processor containing column k  procs(L j* ) = {map[k] | k in Struct(L j* )}  u(j, k) – scaled column accumulated into the factor column by cmod(j, k)

Parallel Sub column Cholesky for j:= 1 to n do if j in mycols(p) or Struct(L j* ) ∩ mycols(p) ≠ 0 do u = 0 for k in Struct(L j* ) ∩ mycols(p) do u = u + u(j,k) if map[j] ≠ p do send u to processor q = map[j] else incorporate u into the factor column j while any aggregated update column for column j remains unreceived do receive in u another aggregated update column for column j incoprporate u into the factor column j cdiv(j) Has uniform and less communication than sub matrix version Difference is due to accessing pattern of Struct(L j* ) and Struct(L *j )

A refined version – compute-ahead fan-in  The previous version can lead to processor idling due to waiting for the aggregates for updating column j  Updating column j can be mixed with compute-ahead tasks: 1.Aggregate u(i, k) for i > j for each completed column k in Struct(L i* ) ∩ mycols(p) 2.Receive aggregate update column for i > j and incorporate into factor column i

Triangular Solve: Parallel Forward and Back Substitution (Anshul Gupta, Vipin Kumar – Supercomputing ’95)

Forward Substitution  Computation starts with leaf supernodes of the elimination trees  The portion of L corresponding to a supernode is a dense trapezoid of width t and height n t - number of nodes/columns in supernode n - number of non-zeros in the leftmost column of the supernode

Forward Substitution - Example

Steps at Supernode 1.Initial processing - A vector, rhs of size n is formed. 1.The 1 st t elements correspond to the elements in RHS vector with the same indices as nodes of the supernode. 2.The remaining n-t elements are filled with 0’s. 2.Computation 1.Solve dense triangular system at the top of the trapezoid in the supernode. 2.Form updates corresponding to remaining n-t rows of the supernode 1.Vector x – product of (bottom (n-t)xt submatrix of L, vector of size t containing solutions from step 1) 2.Subtract x from bottom n-t elements of rhs 3.Add bottom n-t elements of rhs with corresponding (same index) entries of rhs at parent supernode Step 2.1 at any supernode can begin after contributions from all its children

Parallelization  For levels >= logP, the above steps are performed sequentially on a single processor  For supernode with 0 <= l < logP, the above computation steps are performed in parallel on p/2 l processors.  Pipelined or wavefront algorithm is used.

Partitioning 1.Assuming unlimited parallelism 2.At a single time step, only t processors are used. 3.At a single time step, only one block per row and one block per column are active 4.Might as well use 1-D block cyclic.

1-D block cyclic along rows

Sparse Iterative Methods

Iterative & Direct methods – Pros and Cons.  Iterative methods do not give accurate results.  Convergence cannot be predicted  But absolutely no fills.

Parallel Jacobi, Gauss-Seidel, SOR  For problems with grid structure (1- D, 2-D etc.), Jacobi is easily parallelizable  Gauss-Seidel and SOR need recent values. Hence ordering of updates and sequencing among processors  But Gauss-Seidel and SOR can be parallelized using red-black ordering or checker board

2D Grid example 13 9 5 1 14 10 6 2 15 11 7 3 16 12 8 4

Red-Black Ordering  Color alternate nodes in each dimension red and black  Number red nodes first and then black nodes  Red nodes can be updated simultaneously followed by simultaneous black nodes updates

2D Grid example – Red Black Ordering 15 5 11 1 7 13 3 9 16 6 12 2 8 14 4 10  In general, reordering can affect convergence

Multi-Color orderings  In general multi-color orderings for an arbitrary graph  Ordering can lead to reduced convergence rate; but can lead to more parallelism  Need to strike a balance  Multi-color orderings can also be used for pre-conditioned CG

Pre-conditioned CG  Instead of solving Ax = b  Solve A’x’ = b’ where A’ = C -1 AC -1, x’ = Cx, b’ = C -1 b to improve convergence  M = C 2 is called the pre-conditioner

Incomplete Cholesky Preconditioner  M = HH T where H is the “incomplete” Cholesky factor of A  One way of incomplete Cholesky – Have h ij = 0 when a ij = 0

Pre-Conditioned CG k = 0 r 0 = b – Ax 0 while (r k ≠ 0) Solve Mz k = r k (2 triangular solves – Parallelization is not straightforward) k = k+1 if k = 1 p 1 = z 0 else β k = r k-1 T z k-1 /r k-2 T z k-2 p k = z k-1 + β k p k-1 end α k = r k-1 T z k-1 /p k T Ap k x k = x k-1 + α k p k r k = r k-1 – α k Ap k end x = x k

Graph Coloring  Graph Colored Ordering for parallel computing of Gauss-Seidel and applying incomplete Cholesky preconditioners  It was shown (Schreiber and Tang) that minimum number of parallel steps in triangular solve is given by the chromatic number of symmetric graph  Thus permutation matrix, P based on graph color ordering  Incomplete Cholesky applied to PAP T  Unknowns corresponding to nodes of same color are solved in parallel; computation proceeds in steps

Parallel Triangular Solve based on Multi-Coloring  Triangular solve Ly = b ( 2 steps )  b w = b w – L wv y v (Corresponds to traversing the edge )  y w = b w / L ww (Corresponds to visiting vertex w)  The steps can be done in parallel for all v with same color  Thus parallel triangular solve proceeds in steps equal to the number of colors 1, 1 2, 7 3, 2 4, 3 6, 8 7, 9 5, 4 8, 5 9, 6 10, 10 Original Order New Order

Graph Coloring Problem  Given G(A) = (V, E)  σ: V {1,2, …,s} is s-coloring of G if σ(i) ≠ σ(j) for every (i, j) edge in E  Minimum possible value of s is chromatic number of G  Graph coloring problem is to color nodes with chromatic number of colors  NP-complete problem

Heuristics – Greedy Heuristic 1. Compute a vertex ordering {v 1,…,v n } for V 2. For i = 1 to n, set σ(v i ) equal to smallest available consistent color  How to do step 1? 1 2 3 5 4 1 3 4 2 5 Non optimal! Leads to more colors. Hence step 1 is important.

Heuristics – Saturation Degree Ordering  Let {v 1,..,v i-1 } have been chosen  Choose v i such that v i is adjacent to maximum number of different colors in {v 1,..,v i-1 }

Parallel graph Coloring – General algorithm

Parallel Graph Coloring – Finding Maximal Independent Sets – Luby (1986) I = null V’ = V G’ = G While G’ ≠ empty Choose an independent set I ’ in G ’ I = I U I ’ ; X = I ’ U N(I ’ ) (N(I ’ ) – adjacent vertices to I ’ ) V ’ = V ’ \ X; G ’ = G(V ’ ) end For choosing independent set I ’ : (Monte Carlo Heuristic) 1.For each vertex, v in V ’ determine a distinct random number p(v) 2.v in I iff p(v) > p(w) for every w in adj(v) Color each MIS a different color Disadvantage:  Each new choice of random numbers requires a global synchronization of the processors.

Parallel Graph Coloring – Gebremedhin and Manne (2003) Pseudo-Coloring

References in Graph Coloring  M. Luby. A simple parallel algorithm for the maximal independent set problem. SIAM Journal on Computing. 15(4)1036-1054 (1986)  M.T.Jones, P.E. Plassmann. A parallel graph coloring heuristic. SIAM journal of scientific computing, 14(3): 654-669, May 1993  L. V. Kale and B. H. Richards and T. D. Allen. Efficient Parallel Graph Coloring with Prioritization, Lecture Notes in Computer Science, vol 1068, August 1995, pp 190- 208. Springer-Verlag.  A.H. Gebremedhin, F. Manne, Scalable parallel graph coloring algorithms, Concurrency: Practice and Experience 12 (2000) 1131-1146.  A.H. Gebremedhin, I.G. Lassous, J. Gustedt, J.A. Telle, Graph coloring on coarse grained multicomputers, Discrete Applied Mathematics, v.131 n.1, p.179-198, 6 September 2003

References  M.T. Heath, E. Ng, B.W. Peyton. Parallel Algorithms for Sparse Linear Systems. SIAM Review. Vol. 33, No. 3, pp. 420-460, September 1991.  A. George, J.W.H. Liu. The Evolution of the Minimum Degree Ordering Algorithm. SIAM Review. Vol. 31, No. 1, pp. 1-19, March 1989.  J. W. H. Liu. Reordering sparse matrices for parallel elimination. Parallel Computing 11 (1989) 73-91

References  Anshul Gupta, Vipin Kumar. Parallel algorithms for forward and back substitution in direct solution of sparse linear systems. Conference on High Performance Networking and Computing. Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM).  P. Raghavan. Efficient Parallel Triangular Solution Using Selective Inversion. Parallel Processing Letters, Vol. 8, No. 1, pp. 29-40, 1998

References  Joseph W. H. Liu. The Multifrontal Method for Sparse Matrix Factorization. SIAM Review. Vol. 34, No. 1, pp. 82-109, March 1992.  Gupta, Karypis and Kumar. Highly Scalable Parallel Algorithms for Sparse Matrix Factorization. TPDS. 1997.

Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.

Similar presentations

Presentation on theme: "Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.

Similar presentations

Presentation on theme: "Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many."— Presentation transcript:

Similar presentations

About project

Feedback