Parallel Linear Solver Using Hierarchical Matrix

Slides:

Advertisements

Similar presentations

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.

Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

1 High performance Computing Applied to a Saltwater Intrusion Numerical Model E. Canot IRISA/CNRS J. Erhel IRISA/INRIA Rennes C. de Dieuleveult IRISA/INRIA.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.

CS 240A: Solving Ax = b in parallel Dense A: Gaussian elimination with partial pivoting (LU) Same flavor as matrix * matrix, but more complicated Sparse.

MULTISCALE COMPUTATIONAL METHODS Achi Brandt The Weizmann Institute of Science UCLA

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

CSCI 317 Mike Heroux1 Sparse Matrix Computations CSCI 317 Mike Heroux.

CS 584. Review n Systems of equations and finite element methods are related.

1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.

Fast, Multiscale Image Segmentation: From Pixels to Semantics Ronen Basri The Weizmann Institute of Science Joint work with Achi Brandt, Meirav Galun,

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Sparse Matrix Methods Day 1: Overview Day 2: Direct methods

The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

An Algebraic Multigrid Solver for Analytical Placement With Layout Based Clustering Hongyu Chen, Chung-Kuan Cheng, Andrew B. Kahng, Bo Yao, Zhengyong Zhu.

CS240A: Conjugate Gradients and the Model Problem.

A Solenoidal Basis Method For Efficient Inductance Extraction H emant Mahawar Vivek Sarin Weiping Shi Texas A&M University College Station, TX.

Module on Computational Astrophysics Jim Stone Department of Astrophysical Sciences 125 Peyton Hall : ph :

1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.

The sequence of graph transformation (P1)-(P2)-(P4) generating an initial mesh with two finite elements GENERATION OF THE TOPOLOGY OF INITIAL MESH Graph.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Iterative and direct linear solvers in fully implicit magnetic reconnection simulations with inexact Newton methods Xuefei (Rebecca) Yuan 1, Xiaoye S.

Complexity of direct methods n 1/2 n 1/3 2D3D Space (fill): O(n log n)O(n 4/3 ) Time (flops): O(n 3/2 )O(n 2 ) Time and space to solve any problem on any.

Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.

Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.

Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.

The swiss-carpet preconditioner: a simple parallel preconditioner of Dirichlet-Neumann type A. Quarteroni (Lausanne and Milan) M. Sala (Lausanne) A. Valli.

Administrivia: May 20, 2013 Course project progress reports due Wednesday. Reading in Multigrid Tutorial: Chapters 3-4: Multigrid cycles and implementation.

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

Simulating complex surface flow by Smoothed Particle Hydrodynamics & Moving Particle Semi-implicit methods Benlong Wang Kai Gong Hua Liu

CS240A: Conjugate Gradients and the Model Problem.

Update on Sandia Albany/FELIX First-Order Stokes (FELIX-FO) Solver Irina K. Tezaur Sandia National Laboratories In collaboration with Mauro Perego, Andy.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections

Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

Inductance Screening and Inductance Matrix Sparsification 1.

A Parallel Linear Solver for Block Circulant Linear Systems with Applications to Acoustics Suzanne Shontz, University of Kansas Ken Czuprynski, University.

A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering

Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.

Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.

CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.

Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University

The Landscape of Sparse Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage More Robust More.

A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.

Numerical Algorithms Chapter 11.

Parallel Direct Methods for Sparse Linear Systems

Hui Liu University of Calgary

Xing Cai University of Oslo

Model Problem: Solving Poisson’s equation for temperature

CS 290N / 219: Sparse Matrix Algorithms

A computational loop k k Integration Newton Iteration

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

CS 290H Administrivia: April 16, 2008

Parallel Algorithm Design using Spectral Graph Theory

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Finite Element Method To be added later 9/18/2018 ELEN 689.

A robust preconditioner for the conjugate gradient method

Scalable Parallel Interoperable Data Analytics Library

Supported by the National Science Foundation.

Inductance Screening and Inductance Matrix Sparsification

Jacobi Project Salvatore Orlando.

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Oct 14, 2014 slides6b.ppt 1.

Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson StencilPattern.ppt Oct 14,

Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.

A computational loop k k Integration Newton Iteration

Presentation transcript:

Parallel Linear Solver Using Hierarchical Matrix C. Chen, L. Cambier, E. Darve, Stanford; S. Rajamanickam, R. Tuminaro, E. Boman, Sandia

[NY times, 2015, Liu, 2017, Aminfar, 2015, Takahashi, 2017] Motivations (1/2) Increasingly large linear systems from various applications Ice Sheet Simulation Structural Mechanics cite Solid Mechanics Electro- magnetics [NY times, 2015, Liu, 2017, Aminfar, 2015, Takahashi, 2017]

Motivations (2/2) Heterogeneous machine architectures NERSC Cori Computer cite NVIDIA GPU Intel KNL

Solve Ax = b Existing linear solvers - sparse direct solvers (e.g., SuperLU) - iterative methods - preconditioners (e.g., ILU, multigrid) Hierarchical solvers/preconditioners - tunable accuracy - fast and robust - high arithmetic intensity - reduced communication and synchronization cite

Key tool: hierarchical matrices [Hackbusch, Bebendorf, Borm, Gu, etc]: off-diagonal blocks of typical differential and integral operators are effectively low-rank Theoretically optimal complexity of operation counts and memory footprint Data-sparse representation - H, H2, HSS and HODLR cite [Aminfar, 2015]

Key idea: compressing fill-in Exploit the low-rank structure of fill-in - singular value-based truncation - in contrast to level-based and threshold-based dropping rules in ILU or IChol - compression of fill-in in a hierarchical fashion cite [Aminfar, 2015]

Algorithm of our solver Partition the sparse matrix (mesh or graph) Compress fill-in and create coarse vertices (red in fig.) Eliminate fine vertices (green in fig.) Recurse on the coarse system (level-1 fill-in) Partition of mesh grid/unknowns [Pouransari, 2017] Far-field (fill-in) compression

Algorithm illustration (1) original partition (2) one coarse node (3) two coarse nodes The coarse level carries all information of the fill-in (up to approximation error). (6) coarse level (5) many coarse nodes (4) three coarse nodes

Parallel algorithm

Parallel algorithm: concurrency Fill-in exists between two clusters with at most distance 2 Fill-in pattern: ILU(1) cite [Yang, 2017]

Parallel algorithm: data dependency d1: boundary nodes; on critical path; requires coloring d2: lower priority than d1 d3: interior nodes; lowest priority; no communication needed

d1: relaxed distance-2 coloring Four colors for regular grids P0 P1 P3 P2

Parallel algorithm Asynchronous algorithm: d3 computation overlaps with d1, d2 communication. d1 nodes d2 nodes d3 nodes No communication

Communication & Computation Local communication with neighbors at every level Batched dense linear algebra, good for many-core processors (e.g., GPU, MIC) Assume: problem size=N, # processors=p, rank=r and M = Nr2/p - computation: O(M) - communication: O(M2/3) - # messages: O( log(N/rp)+log(p) )

Comparison with SuperLU-Dist 16 cores on NERSC Edison

Matrices from UFL collections

Application: ice sheet modeling

Solve for ice sheet velocity Larsen C ice shelf Extruded (thin) mesh in the vertical direction Neumann & Robin boundary conditions [NASA, 2017, Tuminaro, 2016]

Improved convergence Model problem on a cube: strong vertical coupling; anisotropic Preserve near-null space (piecewise constant) in the low-rank approximation [Yang, 2017]

Summary Hierarchical solver - data-sparse representation - fast and robust Parallel algorithm - local communication - high computational intensity - reduced global synchronization Various applications - symmetric and nonsymmetric - positive definite or indefinite

Acknowledgements

References https://www.nytimes.com/2015/09/12/science/climate-study-predicts-huge-sea-level-rise-if-all-fossil-fuels-are-burned.html https://www.nasa.gov/feature/goddard/2017/massive-iceberg-breaks-off-from-antarctica Liu, Yinjian, Jinyou Xiao, An Enhanced Hierarchical Solver for Preconditioning of Large-Scale Finite Element, working manuscript. Aminfar, AmirHossein, A fast and memory efficient sparse solver with applications to finite element matrices (PhD thesis), Stanford University, 2015. Takahashi, Toru, Pieter Coulier, and Eric Darve. "Application of the inverse fast multipole method as a preconditioner in a 3D Helmholtz boundary element method." Journal of Computational Physics 341 (2017): 406-428. Pouransari, Hadi, Pieter Coulier, and Eric Darve. "Fast hierarchical solvers for sparse matrices using extended sparsification and low-rank approximation." SIAM Journal on Scientific Computing 39, no. 3 (2017): A797-A830. Yang, Kai, Hadi Pouransari, and Eric Darve. "Sparse hierarchical solvers with guaranteed convergence." arXiv preprint arXiv:1611.03189 (2016). Tuminaro, Raymond, Mauro Perego, Irina Tezaur, Andrew Salinger, and Stephen Price. "A matrix dependent/algebraic multigrid approach for extruded meshes with applications to ice sheet modeling." SIAM Journal on Scientific Computing 38, no. 5 (2016): C504-C532. Chen, Chao, Hadi Pouransari, Siva Rajamanickam, Erik G. Boman, and Erik Darve. "A distributed-memory hierarchical solver for general sparse linear systems." Journal of Parallel Computing, accepted subject to minor changes.