Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.

Slides:



Advertisements
Similar presentations
How to solve a large sparse linear system arising in groundwater and CFD problems J. Erhel, team Sage, INRIA, Rennes, France Joint work with A. Beaudoin.
Advertisements

Fill Reduction Algorithm Using Diagonal Markowitz Scheme with Local Symmetrization Patrick Amestoy ENSEEIHT-IRIT, France Xiaoye S. Li Esmond Ng Lawrence.
Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
Numerical Algorithms Matrix multiplication
OpenFOAM on a GPU-based Heterogeneous Cluster
Symmetric Minimum Priority Ordering for Sparse Unsymmetric Factorization Patrick Amestoy ENSEEIHT-IRIT (Toulouse) Sherry Li LBNL/NERSC (Berkeley) Esmond.
Multilevel Incomplete Factorizations for Non-Linear FE problems in Geomechanics DMMMSA – University of Padova Department of Mathematical Methods and Models.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
1cs542g-term Sparse matrix data structure  Typically either Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC) Informally “ia-ja” format.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
6/22/2005ICS'20051 Parallel Sparse LU Factorization on Second-class Message Passing Platforms Kai Shen University of Rochester.
1 Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
CS 290H Lecture 12 Column intersection graphs, Ordering for sparsity in LU with partial pivoting Read “Computing the block triangular form of a sparse.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Antonio M. Vidal Jesús Peinado
Computer System Architectures Computer System Software
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
SOME EXPERIMENTS on GRID COMPUTING in COMPUTATIONAL FLUID DYNAMICS Thierry Coupez(**), Alain Dervieux(*), Hugues Digonnet(**), Hervé Guillard(*), Jacques.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
Automatic Differentiation: Introduction Automatic differentiation (AD) is a technology for transforming a subprogram that computes some function into a.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Combinatorial Scientific Computing and Petascale Simulation (CSCAPES) A SciDAC Institute Funded by DOE’s Office of Science Investigators Alex Pothen, Florin.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Lecture 4 Sparse Factorization: Data-flow Organization
PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI, UMR CNRS 5800, Université Bordeaux I Projet ScAlApplix,
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Improvement of existing solvers for the simulation of MHD instabilities Numerical Flow Models for Controlled Fusion Porquerolles, April 2007 P. Hénon,
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Data Structures and Algorithms in Parallel Computing Lecture 7.
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Symmetric-pattern multifrontal factorization T(A) G(A)
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
CSCAPES Mission Research and development Provide load balancing and parallelization toolkits for petascale computation Develop advanced automatic differentiation.
1 Des bulles dans le PaStiX Réunion NUMASIS Mathieu Faverge ScAlApplix project, INRIA Futurs Bordeaux 29 novembre 2006.
A Scalable Parallel Preconditioned Sparse Linear System Solver Murat ManguoğluMiddle East Technical University, Turkey Joint work with: Ahmed Sameh Purdue.
Auburn University
Parallel processing is not easy
CEA’s Parallel industrial codes in electromagnetism
A computational loop k k Integration Newton Iteration
Resource Elasticity for Large-Scale Machine Learning
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Programming Models for SimMillennium
CS 290H Administrivia: April 16, 2008
Parallel Programming in C with MPI and OpenMP
GPU Implementations for Finite Element Methods
CSCE569 Parallel Computing
Supported by the National Science Foundation.
Parallel Programming in C with MPI and OpenMP
A computational loop k k Integration Newton Iteration
Presentation transcript:

Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method with constraints (tight coupling) Symbolic block factorization –Linear time and space complexities Static scheduling –Logical simulation of computations of the block solver –Cost modeling for the target machine –Task scheduling & communication scheme Parallel supernodal factorization –Total/Partial aggregation of contributions –Memory constraints Solving large sparse symmetric positive definite systems Ax=b of linear equations is a crucial and time-consuming step, arising in many scientific and engineering applications. This work is a research scope of the new INRIA project (UR Futurs) is a scientific library that provides a high performance solver for very large sparse linear systems based on direct and ILU(k) iterative methods Many factorization algorithms are implemented with simple or double precision (real or complex): LL t (Cholesky), LDL t (Crout) and LU with static pivoting (for non symetric matrices but with symetric structures) The library uses the graph partitioning and sparse matrix block ordering package is based on efficient static scheduling and memory management to solve problems with more than 10 millions unknowns An available version of is currently developped A Parallel Direct Solver for Very Large Sparse SPD Systems Software Overview Mapping and Scheduling Crucial Issues Exploitingthree levels of parallelism –Manage parallelism induced by sparsity (block elimination tree) –Split and distribute the dense blocks in order to take into account the potential parallelism induced by dense computations –Use optimal blocksize for pipelined BLAS3 operations Partitioning and mapping problems –Computation of precedence constraints laid down by the factorization algorithm (elimination tree) –Workload estimation that must take into account BLAS effects and communication latency –Locality of communications –Concurrent task ordering for solver scheduling –Taking into account the extra workload due to the aggregation approach of the solver Heterogeneous architectures (SMP nodes) Partitioning (step 1): a variant of the proportionnal mapping technique Mapping (step 2): consists in a down-top mapping of the new elimination tree induced by a logical simulation of computations of the block solver Yield 1D and 2D block distributions –BLAS efficiency on compacted small supernodes → 1D –Scalability on larger supernodes → 2D Matrix partitioning Task graph Block symbolic matrix BLAS and MPI cost modeling Number of processors Mapping and Scheduling Local dataTask scheduling Communication scheme Memory constraints Reduction of memory overhead Parallel factorization New communication scheme Irregular (sparse) Partitioning Scheduling Mapping HPC Ressources Communication Scalable problems 3D unknowns Cluster of SMP Heterogeneous network In-Core Out-of-Core Architecture complexity Hybrid iterative-direct block solver Applications 10 8 Homogeneous network Partial Agg. OSSAU ARLAS Industrial Fluid Dyn. Mol. Chim. Academic Partial aggregation to reduce the memory overhead Memory overhead due to aggregations is limited to a user value Volume of additional communications is minimized Additional messages have an optimal priority order in the initial communication scheme A reduction about 50% of the memory overhead induces less than 20% of time penalty on many test problems AUDI matrix (PARASOL collection, n= , nnzl= , 5.3 Teraflops) has been factorized in 188sec on 64 Power3 procs with a reduction about 50% of the memory overhead (28 Gigaflops/s) Out-of-Core technique compatible with scheduling strategy Manage computation/IO overlap with Asynchronous IO library (AIO) General algorithm based on the knowlege of the data access Algorithmic minimization of the IO volume in function of a user memory limit Work in progress, preliminary experiments show moderate increasing of the number of disk requests Articles in journal Parallel Computing, 28(2): , P. Hénon, P. Ramet, J. Roman Numerical Algorithms, Baltzer, Science Publisher, 24: , D. Goudin, P. Hénon, F.Pellegrini, P. Ramet, J. Roman, J.-J. Pesqué Concurrency: Practice and Experience, 12:69-84, F. Pellegrini, J. Roman, P. Amestoy Conference’s articles Tenth SIAM Conference PPSC’2001, Portsmouth, Virginie, USA, March P. Hénon, P. Ramet, J. Roman Irregular'2000, Cancun, Mexique, LNCS 1800, pages , May Springer Verlag. P. Hénon, P. Ramet, J. Roman EuroPar'99, Toulouse, France, LNCS 1685, pages , September Springer Verlag. P. Hénon, P. Ramet, J. Roman 1D block distribution 2D block distribution Toward a compromise between memory saving and numerical robustness ILU(k) block preconditioner obtained by an incomplete block symbolic factorization NSF/INRIA collaboration IBM SP3 (CINES) with 28 NH2 SMP Nodes (16 power3) and 16 Go shared memory per node Level fill values for a 3D F.E. mesh Allocated memory Memory Access during factorization % Reduction of the memory overhead % Time penalty Industrial Applications (CEA/CESTA) Structural engineering 2D/3D problems (OSSAU) –Computes the response of the structure to various physical constraints –Non linear when plasticity occurs –System is not well conditionned: not a M-matrix, not diagonally dominant –Highly scalable parallel assembly for irregular meshes (generic step of the library) –COUPOL40000 (> unknowns, >10 Teraflops) has been factorized in 20sec on 768 EV68 procs → 500 Gigaflops/s (about 35% peak performance) Electromagnetism problems (ARLAS) –3D Finite Elements code on the internal domain –Integral equation code on the separation frontier –Schurr complement to realize the coupling – unknowns for sparse system and unknowns for dense system on 256 EV68 procs → 8min for sparse factorisation and 200min for Schurr complement (1.5sec per forward/backward substitution) dense sparse coupling P. Amestoy (Enseeiht-IRIT), S. Li and E. Ng (Berkeley), Y. Saad (Minneapolis)