PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI, UMR CNRS 5800, Université Bordeaux I Projet ScAlApplix,

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Numerical Parallel Algorithms for Large-Scale Nanoelectronics Simulations using NESSIE Eric Polizzi, Ahmed Sameh Department of Computer Sciences, Purdue.
OpenFOAM on a GPU-based Heterogeneous Cluster
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Lecture 1 – Parallel Programming Primer CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Antonio M. Vidal Jesús Peinado
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Trends in Sparse Linear Research and Software Developments in France P. Amestoy 1, M. Daydé 1, I. Duff 2, L. Giraud 1, A. Haidar 3, S. Lanteri 4, J.-Y.
Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,
SOME EXPERIMENTS on GRID COMPUTING in COMPUTATIONAL FLUID DYNAMICS Thierry Coupez(**), Alain Dervieux(*), Hugues Digonnet(**), Hervé Guillard(*), Jacques.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Threads, Thread management & Resource Management.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Parallel Solution of the Poisson Problem Using MPI
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Improvement of existing solvers for the simulation of MHD instabilities Numerical Flow Models for Controlled Fusion Porquerolles, April 2007 P. Hénon,
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
1 1  Capabilities: Serial (C), shared-memory (OpenMP or Pthreads), distributed-memory (hybrid MPI+ OpenM + CUDA). All have Fortran interface. Sparse LU.
1 1  Capabilities: Serial (thread-safe), shared-memory (SuperLU_MT, OpenMP or Pthreads), distributed-memory (SuperLU_DIST, hybrid MPI+ OpenM + CUDA).
CS 290H Lecture 15 GESP concluded Final presentations for survey projects next Tue and Thu 20-minute talk with at least 5 min for questions and discussion.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Parallel Computing Presented by Justin Reschke
Symmetric-pattern multifrontal factorization T(A) G(A)
ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
1 Des bulles dans le PaStiX Réunion NUMASIS Mathieu Faverge ScAlApplix project, INRIA Futurs Bordeaux 29 novembre 2006.
Auburn University
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Lecture 1 – Parallel Programming Primer
Xing Cai University of Oslo
CEA’s Parallel industrial codes in electromagnetism
A computational loop k k Integration Newton Iteration
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Task Scheduling for Multicore CPUs and NUMA Systems
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
L21: Putting it together: Tree Search (Ch. 6)
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
CSCI1600: Embedded and Real Time Software
Chapter 4: Threads.
Chapter 4: Threads.
A robust preconditioner for the conjugate gradient method
Hybrid Programming with OpenMP and MPI
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Parallel Programming in C with MPI and OpenMP
CSCI1600: Embedded and Real Time Software
A computational loop k k Integration Newton Iteration
Presentation transcript:

PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI, UMR CNRS 5800, Université Bordeaux I Projet ScAlApplix, INRIA Futurs

PaStiX solver Current development team (SOLSTICE)‏ P. Hénon (researcher INRIA)‏ F. Pellegrini (assistant professor LaBRI/INRIA)‏ P. Ramet (assistant professor LaBRI/INRIA)‏ J. Roman (professor, leader of ScAlApplix INRIA project)‏ PhD student & engineer M. Faverge (NUMASIS project)‏ X. Lacoste (INRIA)‏ Others contributors since 1998 D. Goudin (CEA-DAM)‏ D. Lecas (CEA-DAM)‏ Main users Electomagnetism & structural mechanics codes at CEA-DAM CESTA MHD Plasma instabilities for ITER at CEA-Cadarache (ASTER)‏ Fluid mechanics at MAB Bordeaux

PaStiX solver Functionnalities LLt, LDLt, LU factorization (symmetric pattern) with supernodal implementation Static pivoting (Max. Weight Matching) + It. Raff. / CG / GMRES 1D/2D block distribution + Full BLAS3 Support external ordering library (provided Scotch ordering)‏ MPI/Threads implementation (SMP node / Cluster / Multi-core / NUMA)‏ Simple/Double precision + Float/Complexe operations Require only C + MPI + Posix Thread Multiple RHS (direct factorization)‏ Incomplete factorization ILU(k) preconditionner Available on INRIA Gforge All-in-One source code Easy to install on Linux or AIX systems Simple API (WSMP like)‏ Thread safe (can be called from multiple threads in multiple MPI communicators)‏ Current works Use of parallel ordering (PT-Scotch) and parallel symbolic factorization Dynamic scheduling inside SMP nodes (static mapping)‏ Out-of Core implementation Generic Finite Element Assembly (domaine decomposition associated to matrix distribution)‏

pastix.gforge.inria.fr Latest publication : to appear in Parallel Computing : On finding approximate supernodes for an efficient ILU(k) factorization For more publications, see :

Direct solver chain (in PaStiX)‏ Scotch (ordering & amalgamation)‏ Fax (block symbolic factorization)‏ Blend (refinement & mapping)‏ Sopalin (factorizing & solving)‏ graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution Analyze (sequential steps)‏// fact. and solve

Direct solver chain (in PaStiX)‏ Scotch (ordering & amalgamation)‏ Fax (block symbolic factorization)‏ Blend (refinement & mapping)‏ Sopalin (factorizing & solving)‏ graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution Sparse matrix ordering (minimizes fill-in)‏ Scotch: an hybrid algorithm incomplete Nested Dissection the resulting subgraphs being ordered with an Approximate Minimum Degree method under constraints (HAMD)‏

Partition and Block Elimination Tree → domain decomposition

Direct solver chain (in PaStiX)‏ Scotch (ordering & amalgamation)‏ Fax (block symbolic factorization)‏ Blend (refinement & mapping)‏ Sopalin (factorizing & solving)‏ graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution The symbolic block factorization Q(G,P) → Q(G,P)*=Q(G*,P) => linear in number of blocks! Dense block structures → only a extra few pointers to store the matrix

Direct solver chain (in PaStiX)‏ Scotch (ordering & amalgamation)‏ Fax (block symbolic factorization)‏ Blend (refinement & mapping)‏ Sopalin (factorizing & solving)‏ graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution CPU time prediction Exact memory ressources

Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation (all processors in the same SMP node work directly in share memory)‏  Less MPI communication and lower the parallel memory overcost Scotch (ordering & amalgamation)‏ Fax (block symbolic factorization)‏ Blend (refinement & mapping)‏ Sopalin (factorizing & solving)‏ graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution Direct solver chain (in PaStiX)‏

Ways to reduce memory overhead Partial aggregation Aims at reducing the memory overhead due to the aggregations Additional step at the end of the pre-processing step by using a given memory upper bound  Anticipated emission of partial aggregated blocks On almost test cases:  50% memory →  30% factorization time Need more ? Out-of-Core implementation Manage IO disk / computation overlap The static scheduling → a good prefetch IO scheme Need first an efficient SMP implementation to share and schedule IO requests

Mapping by processor Static scheduling by processor Each processor owns its local part of the matrix (private user space) Message passing (MPI or MPI_shared_memory) between any processors Aggregation of all contributions is done per processor Data coherency insured by MPI semantic Mapping by SMP node Static scheduling by thread All the processors on a same SMP node share a local part of the matrix (shared user space)‏ Message passing (MPI) between processors on different SMP nodes Direct access to shared memory (pthread) between processors on a same SMP node Aggregation of non local contributions is done per node Data coherency insured by explicit mutex MPI/Threads implementation for SMP clusters

MPI only Processors 1 and 2 belong to the same SMP node Data exchanges when only MPI processes are used in the parallelization

MPI/Threads Thread 1 and 2 are created by one MPI process Data exchanges when there is one MPI process per SMP node and one thread per processor  much less MPI communications (only between SMP nodes)‏  no aggregation for modifications of all blocks belonging to processors inside of a SMP node

AUDI : (symmetric)‏ requires MPI_THREAD_MULTIPLE !

AUDI : (symmetric)‏ SP4 : 32 ways Power4+ (with 64Go)‏ 51,02100,44 47,2896,318 47,8093, ,0394, Procs Num. of threads / MPI process ,3 34, ,399,74 57,798,68 59, Procs Num. of threads / MPI process SP5 : 16 ways Power5 (with 32Go)‏ pb. alloc. no meaning

MHD1 : (unsymmetric)‏ SP4 : 32 ways Power4+ (with 64Go)‏ 117,80202, ,79197, ,99187, ,97199, Procs Num. of threads / MPI process , ,3 84, Procs Num. of threads / MPI process SP5 : 16 ways Power5 (with 32Go)‏

How to reduce memory resources Goal: we want to adapt a (supernodal) parallel direct solver (PaStiX) to build an incomplete block factorization and benefit from all the features that it provides:  Algorithmic is based on linear algebra kernels (BLAS)  Load-balancing and task scheduling are based on a fine modeling of computation and communication  Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation

Main steps for incomplete factorization algorithm 1. find the partition P 0 induced by the supernodes of A 2. compute the block symbolic incomplete factorization Q(G,P 0 ) k =Q(G k,P 0 ) ‏ 3. find the exact supernode partition in Q(G,P 0 ) k 4. given a extra fill-in tolerance α, construct an approximated supernode partition P α to improve the block structure of the incomplete factors 5. apply a block incomplete factorization using the parallelization techniques developed for our direct solver PaStiX

Parallel Time: AUDI (Power5)‏ % % % % % %1 Total TR SolvFactTotalTR solvFactαK 16 processors1 processor

ASTER project ANR CIS 2006 Adaptive MHD Simulation of Tokamak ELMs for ITER (ASTER)‏ The ELM (Edge-Localized-Mode) is MHD instability which localised at the boundary of the plasma The energy losses induced by the ELMs within several hundred microseconds are a real concern for ITER The non-linear MHD simulation code JOREK is under development at the CEA to study the evolution of the ELM instability To simulate the complete cycle of the ELM instability, a large range of time scales need to be resolved to study: The evolution of the equilibrium pressure gradient (~seconds)‏ ELM instability (~hundred microseconds)‏  a fully implicit time evolution scheme is used in the JOREK code This leads to a large sparse matrix system to be solved at every time step. This scheme is possible due to the recent advances made in the parallelized direct solution of general sparse matrices (MUMPS, PaStiX, SuperLU,...)‏