A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue University, USA Madan Sathe University of Basel, Switzerland Olaf Schenk University of Basel, Switzerland Precond11, May Bordeaux, France
Design principles Reducing memory references and interprocessor communications at the cost of extra arithmetic operations. Allowing multiple levels of parallelism. Creating a polyalgorithm – versions vary depending on architectural characteristics of the underlying computing platform.
Outline 1. Solve (on parallel computers): Ax = f ; where A is large and sparse 2. Reorder a coefficient matrix A such that large elements are clustered around the main diagonal
PART I Parallel Solution of Sparse Linear Systems Ax = f
Target Computational Loop Integration Newton iteration Linear system solvers kk kk tt Most time consuming part in large simulations
Solving Ax=f Solve Mz = r using DS fact. (M is “banded”) BiCGstab (or any other Krylov subspace method) Step 1: weighted matrix reordering (symmetric/nonsymmetric) Step 2: extract a “banded” preconditioner,M, and solve the system: Manguoglu M, Koyuturk M, Sameh A, Grama A, SIAM Journal on Scientific Computing, 32(3), pp , 2010
reordering with weights and extracting the preconditioner A1A1 A2A2 A3A3 A4A4 C2C2 C3C3 C4C4 B1B1 B2B2 B3B3 PSPIKE Manguoglu M, Sameh A, and Schenk O, PSPIKE: A Parallel Hybrid Sparse Linear System Solver, Lecture Notes in Computer Science(proceedings of EuroPar09), Volume 5704, pp , 2009
Parallel solution of banded linear systems
Robustness of the narrow banded preconditioner 17 linear systems from UF sparse matrix collection. N: ranging from 7,000 to 2 M. NNZ: ranging from 42,000 to 20 M. Stopping criterion: relres ≤
Robustness 12
g3_circuit: ~1.5M unknowns, speed improvement compared to Pardiso on one core (75.3 seconds) Platform: a cluster where each node has two Intel Xeon / 8 cores per node Stopping tolarence: ||f-Ax|| ∞ /||f|| ∞ ≤ 10 -6
Sparsity structure of The stiffness and mass matrices: N ~ 1.5 Million Pardiso version 4.1.1; 1.5 M problem Intel Westmere processors, 2.67 GHz, 4 sockets w/ 8 cores each
2 nd approach for solving Ax=f Solve Mz = r using DS factorization BiCGstab (or any other Krylov subspace method) Step 1: extract a “sparse” preconditioner,M, and solve the system:
torso3: ~260K unknowns, speed improvement compared to Pardiso on one core (49.4 seconds)
PART II Parallel Spectral Matrix Reordering (Graph Partitioning)
Eigenvalues of the Laplacian of a Graph Case 1: – A is a symmetric matrix of order n – B = A – The weighted Laplacian matrix L is given by: L(i,i) = ∑ |B(i,k)| ; for k = 1,2,…,n; k ≠ i L(i,j) = - |B(i,j)| ; for i ≠ j Case 2: – A is nonsymmetric – B= (|A|+|A T |)/2 – L is obtained as in Case 1.
The Fiedler vector Obtain the eigenvector of the second smallest eigenvalue of: L x = λ x ; λ:= {0 = λ 1 ≤ λ 2 ≤ ……≤ λ n } Minimize σ = ∑ |B(i,j)| (x(i) – x(j)) 2 for all i, j for which B(i,j) ≠ 0, such that: e T x = 0, and x T x = 1. In other words: σ = x T Lx Miroslav Fiedler, Algebraic connectivity of graphs(English), Czechoslovak Mathematical Journal, vol. 23 (1973), issue 2, pp
Trace Minimization (TraceMin) A x = λ B x – A :=symmetric; B:= symmetric positive definite Minimize tr(Y T A Y) s.t. (Y T BY)=I p Solution: min tr(Y T AY) = Σ λ j (j=1,2,..,p) λ 1 ≤ λ 2 ≤ λ 3 ≤ …≤ λ p < λ p+1 ≤ ……≤ λ n A. Sameh & J. Wisniewski: SIAM J. Num. Analysis, A. Sameh & Z. Tong: J. Comp. Appl. Math., 2000.
TraceMin iteration
A Parallel Algorithm for Computing the Fiedler Vector: TraceMin-Fiedler Based on the Trace Minimization algorithm computes λ 2 and v 2 of L x = λ x – Forms the small Schur complement explicitly – Solution of the linear systems involving L via CG with diagonal preconditioning – L is not spd (it is spsd) Comparison against the best sequential scheme (Harwell Subroutine MC73) Manguoglu M, Cox E, Saied F, Sameh A: in review ACM TOMS 2010 ( Source code(GPL):
Most time consuming part
f2: structral mechanics N: 71,505 NNZ: 5,294,285 Original matrixAfter MC73After TraceMin-Fiedler
Platform: a cluster where each node has two Intel (Westmere) / 12 cores per Number of cores (using at most 8 cores per node) Stopping tolarence: ||Lx 2 -λ 2 x 2 || ∞ /||L|| ∞ ≤ 10 -5
Conclusions ● New, robust and scalable hybrid solvers on distributed memory clusters ● Tracemin-Fiedler is an effective parallel alternative for computing the fiedler vector.
Thank you