A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.

Slides:

Advertisements

Similar presentations

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.

Advertisements

Partitional Algorithms to Detect Complex Clusters

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

1 High performance Computing Applied to a Saltwater Intrusion Numerical Model E. Canot IRISA/CNRS J. Erhel IRISA/INRIA Rennes C. de Dieuleveult IRISA/INRIA.

Weighted Matrix Reordering and Parallel Banded Preconditioners for Nonsymmetric Linear Systems Murat Manguoğlu*, Mehmet Koyutürk**, Ananth Grama* and Ahmed.

CS 240A: Solving Ax = b in parallel Dense A: Gaussian elimination with partial pivoting (LU) Same flavor as matrix * matrix, but more complicated Sparse.

Applied Linear Algebra - in honor of Hans SchneiderMay 25, 2010 A Look-Back Technique of Restart for the GMRES(m) Method Akira IMAKURA † Tomohiro SOGABE.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.

10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.

1 On The Use Of MA47 Solution Procedure In The Finite Element Steady State Heat Analysis Ana Žiberna and Dubravka Mijuca Faculty of Mathematics Department.

Stochastic Optimization of Complex Energy Systems on High-Performance Computers Cosmin G. Petra Mathematics and Computer Science Division Argonne National.

Multilevel Incomplete Factorizations for Non-Linear FE problems in Geomechanics DMMMSA – University of Padova Department of Mathematical Methods and Models.

Scalable Multi-Stage Stochastic Programming Cosmin Petra and Mihai Anitescu Mathematics and Computer Science Division Argonne National Laboratory DOE Applied.

Scientific Computing Seminar AMLS, Spectral Schur Complements and Iterative Computation of Eigenvalues * C. BekasY. Saad Comp. Science & Engineering Dept.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Evaluating Sparse Linear System Solvers on Scalable Parallel Architectures Ananth Grama and Ahmed Sameh Department of Computer Science Purdue University.

Sparse Matrix Methods Day 1: Overview Day 2: Direct methods

The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.

CS240A: Conjugate Gradients and the Model Problem.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Scalable Multi-Stage Stochastic Programming

Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.

ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.

CS 219: Sparse Matrix Algorithms

Efficient Integration of Large Stiff Systems of ODEs Using Exponential Integrators M. Tokman, M. Tokman, University of California, Merced 2 hrs 1.5 hrs.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.

On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

CS240A: Conjugate Gradients and the Model Problem.

Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }

The Fiedler Vector and Graph Partitioning

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.

CS 484. Iterative Methods n Gaussian elimination is considered to be a direct method to solve a system. n An indirect method produces a sequence of values.

HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.

Data Structures and Algorithms in Parallel Computing Lecture 7.

CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections

Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.

Krylov-Subspace Methods - I Lecture 6 Alessandra Nardi Thanks to Prof. Jacob White, Deepak Ramaswamy, Michal Rewienski, and Karen Veroy.

F. Fairag, H Tawfiq and M. Al-Shahrani Department of Math & Stat Department of Mathematics and Statistics, KFUPM. Nov 6, 2013 Preconditioning Technique.

Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.

The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.

High Performance Computing Seminar

Parallel Algorithms for Solution of Large Sparse Linear Systems with Applications Murat Manguoğlu Middle East Technical University, Ankara, Turkey Prace.

Hui Liu University of Calgary

A survey of Exascale Linear Algebra Libraries for Data Assimilation

Xing Cai University of Oslo

Department of Mathematics

Model Problem: Solving Poisson’s equation for temperature

A computational loop k k Integration Newton Iteration

Solving Linear Systems Ax=b

Parallel Linear Solver Using Hierarchical Matrix

Ananth Grama and Ahmed Sameh Department of Computer Science

Michael Overton Scientific Computing Group Broad Interests

3.3 Network-Centric Community Detection

Administrivia: November 9, 2009

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

Linear Algebra Lecture 16.

RKPACK A numerical package for solving large eigenproblems

A computational loop k k Integration Newton Iteration

Linear Algebra Lecture 28.

Presentation transcript:

A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue University, USA Madan Sathe University of Basel, Switzerland Olaf Schenk University of Basel, Switzerland Precond11, May Bordeaux, France

Design principles Reducing memory references and interprocessor communications at the cost of extra arithmetic operations. Allowing multiple levels of parallelism. Creating a polyalgorithm – versions vary depending on architectural characteristics of the underlying computing platform.

Outline 1. Solve (on parallel computers): Ax = f ; where A is large and sparse 2. Reorder a coefficient matrix A such that large elements are clustered around the main diagonal

PART I Parallel Solution of Sparse Linear Systems Ax = f

Target Computational Loop Integration Newton iteration Linear system solvers kk kk tt Most time consuming part in large simulations

Solving Ax=f Solve Mz = r using DS fact. (M is “banded”) BiCGstab (or any other Krylov subspace method) Step 1: weighted matrix reordering (symmetric/nonsymmetric) Step 2: extract a “banded” preconditioner,M, and solve the system: Manguoglu M, Koyuturk M, Sameh A, Grama A, SIAM Journal on Scientific Computing, 32(3), pp , 2010

reordering with weights and extracting the preconditioner A1A1 A2A2 A3A3 A4A4 C2C2 C3C3 C4C4 B1B1 B2B2 B3B3 PSPIKE Manguoglu M, Sameh A, and Schenk O, PSPIKE: A Parallel Hybrid Sparse Linear System Solver, Lecture Notes in Computer Science(proceedings of EuroPar09), Volume 5704, pp , 2009

Parallel solution of banded linear systems

Robustness of the narrow banded preconditioner 17 linear systems from UF sparse matrix collection. N: ranging from 7,000 to 2 M. NNZ: ranging from 42,000 to 20 M. Stopping criterion: relres ≤

Robustness 12

g3_circuit: ~1.5M unknowns, speed improvement compared to Pardiso on one core (75.3 seconds) Platform: a cluster where each node has two Intel Xeon / 8 cores per node Stopping tolarence: ||f-Ax|| ∞ /||f|| ∞ ≤ 10 -6

Sparsity structure of The stiffness and mass matrices: N ~ 1.5 Million Pardiso version 4.1.1; 1.5 M problem Intel Westmere processors, 2.67 GHz, 4 sockets w/ 8 cores each

2 nd approach for solving Ax=f Solve Mz = r using DS factorization BiCGstab (or any other Krylov subspace method) Step 1: extract a “sparse” preconditioner,M, and solve the system:

torso3: ~260K unknowns, speed improvement compared to Pardiso on one core (49.4 seconds)

PART II Parallel Spectral Matrix Reordering (Graph Partitioning)

Eigenvalues of the Laplacian of a Graph Case 1: – A is a symmetric matrix of order n – B = A – The weighted Laplacian matrix L is given by: L(i,i) = ∑ |B(i,k)| ; for k = 1,2,…,n; k ≠ i L(i,j) = - |B(i,j)| ; for i ≠ j Case 2: – A is nonsymmetric – B= (|A|+|A T |)/2 – L is obtained as in Case 1.

The Fiedler vector Obtain the eigenvector of the second smallest eigenvalue of: L x = λ x ; λ:= {0 = λ 1 ≤ λ 2 ≤ ……≤ λ n } Minimize σ = ∑ |B(i,j)| (x(i) – x(j)) 2 for all i, j for which B(i,j) ≠ 0, such that: e T x = 0, and x T x = 1. In other words: σ = x T Lx Miroslav Fiedler, Algebraic connectivity of graphs(English), Czechoslovak Mathematical Journal, vol. 23 (1973), issue 2, pp

Trace Minimization (TraceMin) A x = λ B x – A :=symmetric; B:= symmetric positive definite Minimize tr(Y T A Y) s.t. (Y T BY)=I p Solution: min tr(Y T AY) = Σ λ j (j=1,2,..,p) λ 1 ≤ λ 2 ≤ λ 3 ≤ …≤ λ p < λ p+1 ≤ ……≤ λ n A. Sameh & J. Wisniewski: SIAM J. Num. Analysis, A. Sameh & Z. Tong: J. Comp. Appl. Math., 2000.

TraceMin iteration

A Parallel Algorithm for Computing the Fiedler Vector: TraceMin-Fiedler Based on the Trace Minimization algorithm computes λ 2 and v 2 of L x = λ x – Forms the small Schur complement explicitly – Solution of the linear systems involving L via CG with diagonal preconditioning – L is not spd (it is spsd) Comparison against the best sequential scheme (Harwell Subroutine MC73) Manguoglu M, Cox E, Saied F, Sameh A: in review ACM TOMS 2010 ( Source code(GPL):

Most time consuming part

f2: structral mechanics N: 71,505 NNZ: 5,294,285 Original matrixAfter MC73After TraceMin-Fiedler

Platform: a cluster where each node has two Intel (Westmere) / 12 cores per Number of cores (using at most 8 cores per node) Stopping tolarence: ||Lx 2 -λ 2 x 2 || ∞ /||L|| ∞ ≤ 10 -5

Conclusions ● New, robust and scalable hybrid solvers on distributed memory clusters ● Tracemin-Fiedler is an effective parallel alternative for computing the fiedler vector.

Thank you