Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Load Balancing Parallel Applications on Heterogeneous Platforms.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Graph Algorithms Carl Tropper Department of Computer Science McGill University.
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Clustering.
Lecture 19: Parallel Algorithms
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Partitioning and Divide-and-Conquer Strategies ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 23, 2013.
Distributed Breadth-First Search with 2-D Partitioning Edmond Chow, Keith Henderson, Andy Yoo Lawrence Livermore National Laboratory LLNL Technical report.
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Numerical Algorithms ITCS 4/5145 Parallel Computing UNC-Charlotte, B. Wilkinson, 2009.
MATH 685/ CSI 700/ OR 682 Lecture Notes
1 Parallel Algorithms II Topics: matrix and graph algorithms.
Systems of Linear Equations
SOLVING SYSTEMS OF LINEAR EQUATIONS. Overview A matrix consists of a rectangular array of elements represented by a single symbol (example: [A]). An individual.
Numerical Algorithms Matrix multiplication
Numerical Algorithms • Matrix multiplication
CS 584. Review n Systems of equations and finite element methods are related.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
Lecture 21: Parallel Algorithms
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.
Backtracking.
Treewidth Algorithms and Networks. Treewidth2 Overview Historic introduction: Series parallel graphs Dynamic programming on trees Dynamic programming.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Network Aware Resource Allocation in Distributed Clouds.
Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
TCP Traffic and Congestion Control in ATM Networks
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Solution of Sparse Linear Systems
Lecture 4 Sparse Factorization: Data-flow Organization
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Week 10 - Friday.  What did we talk about last time?  Graph representations  Adjacency matrix  Adjacency lists  Depth first search.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Data Structures and Algorithms in Parallel Computing
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Parallel Graph Algorithms Sathish Vadhiyar. Graph Traversal  Graph search plays an important role in analyzing large data sets  Relationship between.
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
Linear Systems Dinesh A.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Symmetric-pattern multifrontal factorization T(A) G(A)
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
High Performance Computing Seminar
Sparse LA Sathish Vadhiyar. Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many.
Auburn University
Parallel Graph Algorithms
The minimum cost flow problem
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
CS 290H Administrivia: April 16, 2008
Lecture 22: Parallel Algorithms
Sparse LA Sathish Vadhiyar.
Parallel Sort, Search, Graph Algorithms
Parallelismo.
Adaptivity and Dynamic Load Balancing
To accompany the text “Introduction to Parallel Computing”,
Presentation transcript:

Sparse LA Sathish Vadhiyar

Motivation  Sparse computations much more challenging than dense due to complex data structures and memory references  Many physical systems produce sparse matrices  Most of the research and the base case in sparse symmetric positive definite matrices

Sparse Cholesky  To solve Ax = b;  A = LL T ; Ly = b; L T x = y;  Cholesky factorization introduces fill-in

Column oriented left-looking Cholesky

Fill-in Fill: new nonzeros in factor

Permutation Matrix or Ordering  Thus ordering to reduce fill or to enhance numerical stability  Choose permutation matrix P so that Cholesky factor L’ of PAP T has less fill than L.  Triangular solve: L’y = Pb; L’ T z = y; x = P T z  The fill can be predicted in advance  Static data structure can be used – symbolic factorization

Steps  Ordering: Find a permutation P of matrix A,  Symbolic factorization: Set up a data structure for the Cholesky factor L of PAP T,  Numerical factorization: Decompose PAP T into LL T,  Triangular system solution: Ly = Pb; L T z = y; x = P T z.

Sparse Matrices and Graph Theory G(A)

Sparse and Graph F(A)

Ordering  The above order of elimination is “natural”  The first heuristic is minimum degree ordering  Simple and effective  But efficiency depends on tie breaking strategy

Minimum degree ordering for the previous Matrix Ordering – {2,4,5,7,3,1,6} No fill-in !

Ordering  Another ordering is nested dissection (divide-and-conquer)  Find separator S of nodes whose removal (along with edges) divides the graph into 2 disjoint pieces  Variables in each pieces are numbered contiguously and variables in S are numbered last  Leads to bordered block diagonal non-zero pattern  Can be applied recursively

Nested Dissection Illustration S

Nested Dissection Illustration S

Symbolic factorization  Can simulate the numerical factorization  Struct(M i* ) := {k<i | m ik ≠ 0}, Struct(M *j ) := {k>j | m kj ≠ 0}, p(j) := {min{i Є Struct(L *j )}, if Struct(L *j ) ≠ 0, j otherwise}  Struct(L *j ) C Struct(L *p(j) )) U {p(j}}, Struct(L *j ) := Struct(A *j ) U (U i<j {Struct(L *i )|p(i) = j}) – {j}

Symbolic Factorization for j:= 1 to n do R j := 0 for j:= 1 to n do S := Struct(A *j ) for i Є R j do S: = S U Struct(L *i ) – {j} Struct(L *j ) := S if Struct(L *j ) ≠ 0 then p(j) := min{i Є Struct(L *j )} R p(j) := R p(j) U {j}

Numerical Factorization cmod(j, k): modification of column j by column k, k < j cdiv(j) : division of column j by a scalar

Algorithms

Elimination Tree  T(A) has an edge between two vertices i and j, with i > j, if i = p(j). i is the parent of j

Supernode  A set of contiguous columns in the Cholesky factor L that share essentially the same sparsity structure.  The set of contiguous columns j,j+1,…,j+t constitutes a supernode if Struct(L *,k ) = Struct(L *,k+1 ) U {k+1} for j<=k<=j+t-1  Columns in the same supernode can be treated as a unit  Used for enhancing the efficiency of minimum degree ordering and symbolic factorization.

Parallelization of Sparse Cholesky  Most of the parallel algorithms are based on elimination trees  Work associated with two disjoint subtrees can proceed independently  Same steps associated with sequential sparse factorization  One additional step: assignment of tasks to processors

Ordering  2 issues: - ordering in parallel - find an ordering that will help in parallelization in the subsequent steps

Ordering in Parallel – Nested dissection  Nested dissection can be carried in parallel  Also leads to elimination trees that can be parallelized during subsequent factorizations  But parallelization only in the later levels of dissection  Can be applied to only limited class of problems  More later….

Ordering for Parallel Factorization Natural order Elimination Tree No fill. No scope for parallelization No agreed objective for ordering for parallel factorization: Not all orderings that reduce fill-in can provide scope for parallelization

Example (Contd..) Nested dissection order 1 Elimination Tree Fill. But scope for parallelization

Ordering for parallel factorization – Tree restructuring  Decouple fill reducing ordering and ordering for parallel elimination  Determine a fill reducing ordering P of G(A)  Form the elimination tree T(PAP T )  Transform this tree T(PAP T ) to one with smaller height and record the corresponding equivalent reordering, P’

Ordering for parallel factorization – Tree restructuring  Efficiency depends on if such an equivalent reordering can be found  Also on the limitations of the initial ordering, P  Only minor modifications to the initial ordering. Hence only limited improvement in parallelism  Algorithm by Liu (1989) based on elimination tree rotation to reduce the height  Algorithm by Jess and Kees (1982) based on chordal graph to reduce the height

Height and Parallel Completion time  Not all elimination trees with minimum heights give rise to small parallel completion times.  Let each node, v, in elimination tree associated with x,y  x – time[v] or time for factorization of column v  y – level[v]  level[v] = time[v] if v is the root of elimination tree, time[v]+level[parent of v] otherwise Represents minimum time to completion starting at node v  Parallel completion time – maximum level value among all nodes

Height and Parallel Completion time a b c d e f g h i e d c b a f g h i f e d c b g h i a , 14 3, 12 3, 9 3, 6 4, 24 6, 20 6, 14 5, 8 3,3 2, 19 3, 17 3, 14 3, 11 3, 8 4, 21 6, 17 6, 11 5, 5

Minimization of Cost  Thus some recent algorithms (Weng- Yang Lin, J. of Supercomputing: 2003) pick at each step, the nodes with the minimum cost (greedy approach)

Nested Dissection Algorithms  Use a graph partitioning heuristic to obtain a small edge separator of the graph  Transform the small edge separator into a small node separator  Number nodes of the separator last and recursively apply

Algorithm 1 - Level Structures  d(x, y) – distance between x and y  Eccentricity: ε(x) = max yin X {d(x, y)}  Diameter: δ(G) = max of eccentricities  Peripheral node: x in X whose ε(x) = δ(G)  Level structure: a partitioning L = {L 0,..,L l ) such that Adj(L i ) C L i-1 U L i+1

Example Level structure rooted at 6

Breadth First Search  One way of finding level structures is by BFS starting with the peripheral node  Finding peripheral node is expensive. Hence settle for pseudo-peripheral

Pseudo peripheral node 1.Pick arbitrary node r in X 2.Generate a level structure with ε(r) levels 3.Choose node x in last level with minimum degree 4.Generate a level structure rooted at x 5.If ε(x) > ε(r) r =x, go to step 3; else x is the pseudo peripheral 2 1 r r r

ND Heuristic based on BFS 1.Construct a level structure with l levels 2.Form separator S of nodes in level (l+1)/2 3.Recursively apply

Example AB AB a b b a

K-L for ND  Form a random initial partition  Form edge separator by applying K-L to form partitions P1 and P2  Let V1 in P1 such that nodes in V1 incident on atleast one edge in the separator set. Similarly V2  V1 U V2 (wide node separator),  V1 or V2 (narrow node separator) by Gilbert and Zmijewski (1987)

Step 2: Mapping Problems on to processors  Based on elimination trees  But elimination trees are determined from the structure of L which happens in symbolic factorization (step 3) – bootstrapping problem !  Efficient algorithms exist to find elimination trees from the structure of A.  Parallel calculation of elimination tree by zmijewski and Gilbert where each processor computes “local” version of elimination tree and then the local versions are combined.  Various strategies to map columns to processors based on elimination trees.  Strategy 1: Successive levels in elimination tree are wrap- mapped onto processors  Strategy 2: Subtree-to-Subcube  Strategy 3: Bin-Pack by Geist and Ng

Strategy

Strategy 2 – Subtree-to-subcube mapping  Select an appropriate set of P subtrees of the elimination tree, say T0, T1…  Assign columns corresponding to Ti to Pi  Where two subtrees merge into a single subtree, their processor sets are merged together and wrap-mapped onto the nodes/columns of the separator that begins at that point.  The root separator is wrap-mapped onto the set of all processors.

Strategy

Strategy 3: Bin-Pack (Geist and Ng)  Try to find disjoint subtrees  Map the subtrees to p bins based on first-fit-decreasing bin- packing heuristic Subtrees are processed in decreasing order of workloads A subtree is packed into the current lightest bin  Weight imbalance, α – ratio between lightest and heaviest bin  If α >= user-specified tolerance, γ, stop  Else explore the heaviest subtree from the tree and split into subtrees and p bins are repacked using bin-packing again  Repeat until α >= γ or the largest subtree cannot be split further  Load balance based on user-specified tolerance  For the remaining nodes from the roots of the subtrees to the root of the tree, wrap map.

Parallel symbolic factorization  Sequential symbolic factorization is very efficient  Not able to achieve good speedups with parallel versions – Limited parallelism, small task sizes, high communication overhead  Mapping strategies typically wrap-map processors to the columns of same supernode – hence more storage and work than sequential version  For supernodal structure, only the processor holding the 1 st column of the supernode calculates the structure  Other processors holding other columns of supernode simply retrieve the structure from the processor holding the first column.

Parallel Numerical Factorization – Submatrix Cholesky Tsub(k) Tsub(k) is partitioned into various subtasks Tsub(k,1),…,Tsub(k,P) where Tsub(k,p) := {cmod(j,k) | j C Struct(L *k ) ∩ mycols(p)}

Definitions  mycols(p) – set of columns owned by p  map[k] – processor containing column k  procs(L *k ) = {map[j] | j in Struct(L *k )}

Parallel Submatrix Cholesky for j in mycols(p) do if j is a leaf node in T(A) do cdiv(j) send L *j to the processors in procs(L *j ) mycols(p) := mycols(p) – {j} while mycols(p) ≠ 0 do receive any column of L, say L *k for j in Struct(L *k ) ∩ mycols(p) do cmod(j, k) if column j required no more cmod ’ s do cdiv(j) send L *j to the processors in procs(L *j ) mycols(p) := mycols(p) – {j} Disadvantages: 1.Communication is not localized

Parallel Numerical Factorization – Sub column Cholesky Tcol(j) is partitioned into various subtasks Tcol(j,1),…,Tcol(j,P) where Tcol(j,p) aggregates into a single update vector every update vector u(j,k) for which k C Struct(L j* ) ∩ mycols(p)

Definitions  mycols(p) – set of columns owned by p  map[k] – processor containing column k  procs(L j* ) = {map[k] | k in Struct(L j* )}  u(j, k) – scaled column accumulated into the factor column by cmod(j, k)

Parallel Sub column Cholesky for j:= 1 to n do if j in mycols(p) or Struct(L j* ) ∩ mycols(p) ≠ 0 do u = 0 for k in Struct(L j* ) ∩ mycols(p) do u = u + u(j,k) if map[j] ≠ p do send u to processor q = map[j] else incorporate u into the factor column j while any aggregated update column for column j remains unreceived do receive in u another aggregated update column for column j incoprporate u into the factor column j cdiv(j) Has uniform and less communication than sub matrix version Difference is due to accessing pattern of Struct(L j* ) and Struct(L *j )

A refined version – compute-ahead fan-in  The previous version can lead to processor idling due to waiting for the aggregates for updating column j  Updating column j can be mixed with compute-ahead tasks: 1.Aggregate u(i, k) for i > j for each completed column k in Struct(L i* ) ∩ mycols(p) 2.Receive aggregate update column for i > j and incorporate into factor column i

Triangular Solve: Parallel Forward and Back Substitution (Anshul Gupta, Vipin Kumar – Supercomputing ’95)

Forward Substitution  Computation starts with leaf supernodes of the elimination trees  The portion of L corresponding to a supernode is a dense trapezoid of width t and height n t - number of nodes/columns in supernode n - number of non-zeros in the leftmost column of the supernode

Forward Substitution - Example

Steps at Supernode 1.Initial processing - A vector, rhs of size n is formed. 1.The 1 st t elements correspond to the elements in RHS vector with the same indices as nodes of the supernode. 2.The remaining n-t elements are filled with 0’s. 2.Computation 1.Solve dense triangular system at the top of the trapezoid in the supernode. 2.Form updates corresponding to remaining n-t rows of the supernode 1.Vector x – product of (bottom (n-t)xt submatrix of L, vector of size t containing solutions from step 1) 2.Subtract x from bottom n-t elements of rhs 3.Add bottom n-t elements of rhs with corresponding (same index) entries of rhs at parent supernode Step 2.1 at any supernode can begin after contributions from all its children

Parallelization  For levels >= logP, the above steps are performed sequentially on a single processor  For supernode with 0 <= l < logP, the above computation steps are performed in parallel on p/2 l processors.  Pipelined or wavefront algorithm is used.

Partitioning 1.Assuming unlimited parallelism 2.At a single time step, only t processors are used. 3.At a single time step, only one block per row and one block per column are active 4.Might as well use 1-D block cyclic.

1-D block cyclic along rows

Sparse Iterative Methods

Iterative & Direct methods – Pros and Cons.  Iterative methods do not give accurate results.  Convergence cannot be predicted  But absolutely no fills.

Parallel Jacobi, Gauss-Seidel, SOR  For problems with grid structure (1- D, 2-D etc.), Jacobi is easily parallelizable  Gauss-Seidel and SOR need recent values. Hence ordering of updates and sequencing among processors  But Gauss-Seidel and SOR can be parallelized using red-black ordering or checker board

2D Grid example

Red-Black Ordering  Color alternate nodes in each dimension red and black  Number red nodes first and then black nodes  Red nodes can be updated simultaneously followed by simultaneous black nodes updates

2D Grid example – Red Black Ordering  In general, reordering can affect convergence

Multi-Color orderings  In general multi-color orderings for an arbitrary graph  Ordering can lead to reduced convergence rate; but can lead to more parallelism  Need to strike a balance  Multi-color orderings can also be used for pre-conditioned CG

Pre-conditioned CG  Instead of solving Ax = b  Solve A’x’ = b’ where A’ = C -1 AC -1, x’ = Cx, b’ = C -1 b to improve convergence  M = C 2 is called the pre-conditioner

Incomplete Cholesky Preconditioner  M = HH T where H is the “incomplete” Cholesky factor of A  One way of incomplete Cholesky – Have h ij = 0 when a ij = 0

Pre-Conditioned CG k = 0 r 0 = b – Ax 0 while (r k ≠ 0) Solve Mz k = r k (2 triangular solves – Parallelization is not straightforward) k = k+1 if k = 1 p 1 = z 0 else β k = r k-1 T z k-1 /r k-2 T z k-2 p k = z k-1 + β k p k-1 end α k = r k-1 T z k-1 /p k T Ap k x k = x k-1 + α k p k r k = r k-1 – α k Ap k end x = x k

Graph Coloring  Graph Colored Ordering for parallel computing of Gauss-Seidel and applying incomplete Cholesky preconditioners  It was shown (Schreiber and Tang) that minimum number of parallel steps in triangular solve is given by the chromatic number of symmetric graph  Thus permutation matrix, P based on graph color ordering  Incomplete Cholesky applied to PAP T  Unknowns corresponding to nodes of same color are solved in parallel; computation proceeds in steps

Parallel Triangular Solve based on Multi-Coloring  Triangular solve Ly = b ( 2 steps )  b w = b w – L wv y v (Corresponds to traversing the edge )  y w = b w / L ww (Corresponds to visiting vertex w)  The steps can be done in parallel for all v with same color  Thus parallel triangular solve proceeds in steps equal to the number of colors 1, 1 2, 7 3, 2 4, 3 6, 8 7, 9 5, 4 8, 5 9, 6 10, 10 Original Order New Order

Graph Coloring Problem  Given G(A) = (V, E)  σ: V {1,2, …,s} is s-coloring of G if σ(i) ≠ σ(j) for every (i, j) edge in E  Minimum possible value of s is chromatic number of G  Graph coloring problem is to color nodes with chromatic number of colors  NP-complete problem

Heuristics – Greedy Heuristic 1. Compute a vertex ordering {v 1,…,v n } for V 2. For i = 1 to n, set σ(v i ) equal to smallest available consistent color  How to do step 1? Non optimal! Leads to more colors. Hence step 1 is important.

Heuristics – Saturation Degree Ordering  Let {v 1,..,v i-1 } have been chosen  Choose v i such that v i is adjacent to maximum number of different colors in {v 1,..,v i-1 }

Parallel graph Coloring – General algorithm

Parallel Graph Coloring – Finding Maximal Independent Sets – Luby (1986) I = null V’ = V G’ = G While G’ ≠ empty Choose an independent set I ’ in G ’ I = I U I ’ ; X = I ’ U N(I ’ ) (N(I ’ ) – adjacent vertices to I ’ ) V ’ = V ’ \ X; G ’ = G(V ’ ) end For choosing independent set I ’ : (Monte Carlo Heuristic) 1.For each vertex, v in V ’ determine a distinct random number p(v) 2.v in I iff p(v) > p(w) for every w in adj(v) Color each MIS a different color Disadvantage:  Each new choice of random numbers requires a global synchronization of the processors.

Parallel Graph Coloring – Gebremedhin and Manne (2003) Pseudo-Coloring

References in Graph Coloring  M. Luby. A simple parallel algorithm for the maximal independent set problem. SIAM Journal on Computing. 15(4) (1986)  M.T.Jones, P.E. Plassmann. A parallel graph coloring heuristic. SIAM journal of scientific computing, 14(3): , May 1993  L. V. Kale and B. H. Richards and T. D. Allen. Efficient Parallel Graph Coloring with Prioritization, Lecture Notes in Computer Science, vol 1068, August 1995, pp Springer-Verlag.  A.H. Gebremedhin, F. Manne, Scalable parallel graph coloring algorithms, Concurrency: Practice and Experience 12 (2000)  A.H. Gebremedhin, I.G. Lassous, J. Gustedt, J.A. Telle, Graph coloring on coarse grained multicomputers, Discrete Applied Mathematics, v.131 n.1, p , 6 September 2003

References  M.T. Heath, E. Ng, B.W. Peyton. Parallel Algorithms for Sparse Linear Systems. SIAM Review. Vol. 33, No. 3, pp , September  A. George, J.W.H. Liu. The Evolution of the Minimum Degree Ordering Algorithm. SIAM Review. Vol. 31, No. 1, pp. 1-19, March  J. W. H. Liu. Reordering sparse matrices for parallel elimination. Parallel Computing 11 (1989) 73-91

References  Anshul Gupta, Vipin Kumar. Parallel algorithms for forward and back substitution in direct solution of sparse linear systems. Conference on High Performance Networking and Computing. Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM).  P. Raghavan. Efficient Parallel Triangular Solution Using Selective Inversion. Parallel Processing Letters, Vol. 8, No. 1, pp , 1998

References  Joseph W. H. Liu. The Multifrontal Method for Sparse Matrix Factorization. SIAM Review. Vol. 34, No. 1, pp , March  Gupta, Karypis and Kumar. Highly Scalable Parallel Algorithms for Sparse Matrix Factorization. TPDS