MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.

Slides:



Advertisements
Similar presentations
1 A parallel software for a saltwater intrusion problem E. Canot IRISA/CNRS J. Erhel IRISA/INRIA Rennes C. de Dieuleveult IRISA/INRIA Rennes.
Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.
Sparse linear solvers applied to parallel simulations of underground flow in porous and fractured media A. Beaudoin 1, J.R. De Dreuzy 2, J. Erhel 1 and.
Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
1 High performance Computing Applied to a Saltwater Intrusion Numerical Model E. Canot IRISA/CNRS J. Erhel IRISA/INRIA Rennes C. de Dieuleveult IRISA/INRIA.
1 Numerical Simulation for Flow in 3D Highly Heterogeneous Fractured Media H. Mustapha J. Erhel J.R. De Dreuzy H. Mustapha INRIA, SIAM Juin 2005.
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
Chapter 17 Design Analysis using Inventor Stress Analysis Module
Solution of linear system of equations
Symmetric Minimum Priority Ordering for Sparse Unsymmetric Factorization Patrick Amestoy ENSEEIHT-IRIT (Toulouse) Sherry Li LBNL/NERSC (Berkeley) Esmond.
Symmetric Eigensolvers in Sca/LAPACK Osni Marques
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Scientific Computing on Heterogeneous Clusters using DRUM (Dynamic Resource Utilization Model) Jamal Faik 1, J. D. Teresco 2, J. E. Flaherty 1, K. Devine.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
PETSc Portable, Extensible Toolkit for Scientific computing.
6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
Direct and iterative sparse linear solvers applied to groundwater flow simulations Matrix Analysis and Applications October 2007.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Antonio M. Vidal Jesús Peinado
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
1 Discussions on the next PAAP workshop, RIKEN. 2 Collaborations toward PAAP Several potential topics : 1.Applications (Wave Propagation, Climate, Reactor.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Lecture 5 Parallel Sparse Factorization, Triangular Solution
A comparison between a direct and a multigrid sparse linear solvers for highly heterogeneous flux computations A. Beaudoin, J.-R. De Dreuzy and J. Erhel.
Amesos Sparse Direct Solver Package Ken Stanley, Rob Hoekstra, Marzio Sala, Tim Davis, Mike Heroux Trilinos Users Group Albuquerque 3 Nov 2004.
ANS 1998 Winter Meeting DOE 2000 Numerics Capabilities 1 Barry Smith Argonne National Laboratory DOE 2000 Numerics Capability
Linear Systems Iterative Solutions CSE 541 Roger Crawfis.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.
Stochastic optimization of energy systems Cosmin Petra Argonne National Laboratory.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
JAVA AND MATRIX COMPUTATION
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.
PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI, UMR CNRS 5800, Université Bordeaux I Projet ScAlApplix,
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
1 Parallel Software for SemiDefinite Programming with Sparse Schur Complement Matrix Makoto Tokyo-Tech Katsuki Chuo University Mituhiro.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Symmetric-pattern multifrontal factorization T(A) G(A)
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
1 Des bulles dans le PaStiX Réunion NUMASIS Mathieu Faverge ScAlApplix project, INRIA Futurs Bordeaux 29 novembre 2006.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
A computational loop k k Integration Newton Iteration
Parallelism in High-Performance Computing Applications
GENERAL VIEW OF KRATOS MULTIPHYSICS
Hybrid Programming with OpenMP and MPI
A computational loop k k Integration Newton Iteration
Presentation transcript:

MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal solver MPI / F90 based (C user interface also available) Stability based on partial pivoting Dynamic Distributed Scheduling to accomodate both numerical fill-in and multi- user environment Use of BLAS, LAPACK, ScaLAPACK Main features : MUMPS solves large systems of linear equations of the form Ax=b by factorizing A into A=LU or LDL T symmetric or unsymmetric marices (partial pivoting), parallel factorization and solve phases (uniprocessor version also available), Iterative refinement and backward error analysis, various matrix input formats assembled format distributed assembled format sum of elemental matrices Null space functionalities (experimental): rank detection and null space basis Partial factorization and Schur complement matrix, Version for complex arithmetic. A fully asynchronous distributed solver (VAMPIR trace, 8 processors). AVAILABILITY MUMPS is available free of charge for non commercial use. it has been used on a number of platforms (Cray T3E, Origin 2000, IBM SP, Linux clusters, …) by a few hundred current users (finite elements, chemistry, simulation, aeronautics, …) If you are interested in obtaining MUMPS for you own use, please refer to the MUMPS home page. BMW car body unknowns nonzeros MSC.Software. Competitve performance The MUMPS package has a good perfornance relative to other parallel sparse solvers; for example we see in the table below comparisons with the SuperLU code from Demmel and Li. These results are taken from “Analysis and comparison of two general solvers for distributed memory computers”, ACM TOMS, 27, CURRENT RESEARCH : ACTIVE RESEARCH IS FEEDING THE MUMPS SOFTWARE PLATFORM. The MUMPS package has been partially supported by the Esprit IV Project PARASOL and by CERFACS, ENSEEIHT-IRIT, INRIA Rhône-Alpes, LBNL-NERSC,PARALLAB and RAL. The authors are Patrick Amestoy, Jean-Yves L’Excellent, Iain Duff and Jacko Koster. Functionalities related to rank-revealing were first implemented by M. Tuma (Institute of Computer Science, Academy of Sciences of the Czech Republic), while he was at CERFACS. We are also grateful to C. Bousquet, C. Daniel, A. Guermouche, G. Richard, S. Pralet and C. Vömel who have been working on some specific parts of this software. Factorisation time in seconds of large matrices on the CRAY T3E; (1 proc=not enough memory). Reorderings and optimization of the memory usage MUMPS uses state-of-the-art reordering techniques (AMD, AMF, ND, SCOTCH, PORD, METIS). Those techniques have a strong impact on the parallelism and number of operations and we are currently studying their impact of such techniques on the dynamic memory usage of MUMPS. In particular we designed algorithms to optimize the memory occupation of the multifrontal stack. Future work includes dynamic memory load balancing and the design of an out-of-core version. Best decrease obtained using our algorithm to decrease the stack for each reordering technique. Results obtained by A. Guermouche, (PhD student in the INRIA ReMaP project). Mixing dynamic and static scheduling strategies MUMPS uses a completely dynamic approach with distributed scheduling and scales well until around 100 processors. Introducing more static information helps reducing the costs of the dynamic decisions and makes MUMPS more scalable. Matrix Reordering thermal PORD THREAD METIS af23560 AMF xenon2 SCOTCH rma10 AMD Percent. of memory decrease MatrixOrderingSolverNumber of processors bbmatAMD ND(metis) MUMPS SuperLU MUMPS SuperLU ecl32AMD ND(metis) MUMPS SuperLU MUMPS SuperLU Platforms with heterogeneous network (clusters of SMP) In the MUMPS scheduling, work is given to processors according to their load. Giving a penalty to the load of processors on a distant node helps performing tasks with high communication on the same node and improves the performance, as shown in the Table below. This poster was prepared by Jean-Yves L’Excellent Mixing MPI and OpenMP on clusters of SMP We report below on a preliminary experiment of hybrid parallelism on one node (16 procs) of an IBM SP. Best results are obtained when using 8 MPI processes with 2 OpenMP threads each. Regular problem from an 11pt discretization (Cubic grid 64x64x64), ND used. Results obtained by S.Pralet (PhD Cerfacs). Standard MUMPSModified MUMPS Time for factorization 49.2 seconds44.0 seconds Total volume of communication 3957 MB3600 MB Total volume of internode communication 2017 MB1004 MB Effect of taking the hybrid network into account. Matrix PRE2, SCOTCH, 2 nodes of 16 processors of an IBM SP. Results obtained by S. Pralet (PhD CERFACS). Effect of a injecting more static information to the dynamic scheduling of MUMPS. Rectangular grids of increasing size, ND. Results obtained by C. Vömel (PhD Cerfacs) on a CRAY T3E.