A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.

A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue University, USA Madan Sathe University of Basel, Switzerland Olaf Schenk University of Basel, Switzerland Precond11, May 16-18 Bordeaux, France

Design principles Reducing memory references and interprocessor communications at the cost of extra arithmetic operations. Allowing multiple levels of parallelism. Creating a polyalgorithm – versions vary depending on architectural characteristics of the underlying computing platform.

Outline 1. Solve (on parallel computers): Ax = f ; where A is large and sparse 2. Reorder a coefficient matrix A such that large elements are clustered around the main diagonal

PART I Parallel Solution of Sparse Linear Systems Ax = f

Target Computational Loop Integration Newton iteration Linear system solvers kk kk tt Most time consuming part in large simulations

Solving Ax=f Solve Mz = r using DS fact. (M is “banded”) BiCGstab (or any other Krylov subspace method) Step 1: weighted matrix reordering (symmetric/nonsymmetric) Step 2: extract a “banded” preconditioner,M, and solve the system: Manguoglu M, Koyuturk M, Sameh A, Grama A, SIAM Journal on Scientific Computing, 32(3), pp.1201- 1216, 2010

reordering with weights and extracting the preconditioner A1A1 A2A2 A3A3 A4A4 C2C2 C3C3 C4C4 B1B1 B2B2 B3B3 PSPIKE Manguoglu M, Sameh A, and Schenk O, PSPIKE: A Parallel Hybrid Sparse Linear System Solver, Lecture Notes in Computer Science(proceedings of EuroPar09), Volume 5704, pp.797-808, 2009

Parallel solution of banded linear systems

Robustness of the narrow banded preconditioner 17 linear systems from UF sparse matrix collection. N: ranging from 7,000 to 2 M. NNZ: ranging from 42,000 to 20 M. Stopping criterion: relres ≤ 10-5 10

Robustness 12

g3_circuit: ~1.5M unknowns, speed improvement compared to Pardiso on one core (75.3 seconds) Platform: a cluster where each node has two Intel Xeon (X5560@2.8GHz) / 8 cores per node Stopping tolarence: ||f-Ax|| ∞ /||f|| ∞ ≤ 10 -6

Sparsity structure of The stiffness and mass matrices: N ~ 1.5 Million Pardiso version 4.1.1; 1.5 M problem Intel Westmere processors, 2.67 GHz, 4 sockets w/ 8 cores each http://www.pardiso-project.org/

2 nd approach for solving Ax=f Solve Mz = r using DS factorization BiCGstab (or any other Krylov subspace method) Step 1: extract a “sparse” preconditioner,M, and solve the system:

torso3: ~260K unknowns, speed improvement compared to Pardiso on one core (49.4 seconds)

PART II Parallel Spectral Matrix Reordering (Graph Partitioning)

Eigenvalues of the Laplacian of a Graph Case 1: – A is a symmetric matrix of order n – B = A – The weighted Laplacian matrix L is given by: L(i,i) = ∑ |B(i,k)| ; for k = 1,2,…,n; k ≠ i L(i,j) = - |B(i,j)| ; for i ≠ j Case 2: – A is nonsymmetric – B= (|A|+|A T |)/2 – L is obtained as in Case 1.

The Fiedler vector Obtain the eigenvector of the second smallest eigenvalue of: L x = λ x ; λ:= {0 = λ 1 ≤ λ 2 ≤ ……≤ λ n } Minimize σ = ∑ |B(i,j)| (x(i) – x(j)) 2 for all i, j for which B(i,j) ≠ 0, such that: e T x = 0, and x T x = 1. In other words: σ = x T Lx Miroslav Fiedler, Algebraic connectivity of graphs(English), Czechoslovak Mathematical Journal, vol. 23 (1973), issue 2, pp. 298-305

Trace Minimization (TraceMin) A x = λ B x – A :=symmetric; B:= symmetric positive definite Minimize tr(Y T A Y) s.t. (Y T BY)=I p Solution: min tr(Y T AY) = Σ λ j (j=1,2,..,p) λ 1 ≤ λ 2 ≤ λ 3 ≤ …≤ λ p < λ p+1 ≤ ……≤ λ n A. Sameh & J. Wisniewski: SIAM J. Num. Analysis, 1982 + A. Sameh & Z. Tong: J. Comp. Appl. Math., 2000.

TraceMin iteration

A Parallel Algorithm for Computing the Fiedler Vector: TraceMin-Fiedler Based on the Trace Minimization algorithm computes λ 2 and v 2 of L x = λ x – Forms the small Schur complement explicitly – Solution of the linear systems involving L via CG with diagonal preconditioning – L is not spd (it is spsd) Comparison against the best sequential scheme (Harwell Subroutine MC73) Manguoglu M, Cox E, Saied F, Sameh A: in review ACM TOMS 2010 ( http://arxiv.org/abs/1003.3689v1) Source code(GPL): www.cs.purdue.edu/homes/mmanguog/fiedler.html

Most time consuming part

f2: structral mechanics N: 71,505 NNZ: 5,294,285 Original matrixAfter MC73After TraceMin-Fiedler

Platform: a cluster where each node has two Intel (Westmere) X5670@2.93Ghz / 12 cores per nodeX5670@2.93Ghz Number of cores (using at most 8 cores per node) Stopping tolarence: ||Lx 2 -λ 2 x 2 || ∞ /||L|| ∞ ≤ 10 -5

Conclusions ● New, robust and scalable hybrid solvers on distributed memory clusters ● Tracemin-Fiedler is an effective parallel alternative for computing the fiedler vector.

Thank you

A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.

Similar presentations

Presentation on theme: "A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.

Similar presentations

Presentation on theme: "A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue."— Presentation transcript:

Similar presentations

About project

Feedback