1 Des bulles dans le PaStiX Réunion NUMASIS Mathieu Faverge ScAlApplix project, INRIA Futurs Bordeaux 29 novembre 2006.

Slides:



Advertisements
Similar presentations
Distributed Systems CS
Advertisements

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Parallel Computing Overview CS 524 – High-Performance Computing.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Lecture 1 – Parallel Programming Primer CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.
Computer Architecture Parallel Processing
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
PaStiX : how to reduce memory overhead ASTER meeting Bordeaux, Nov 12-14, 2007 PaStiX team LaBRI, UMR CNRS 5800, Université Bordeaux I Projet ScAlApplix,
Improvement of existing solvers for the simulation of MHD instabilities Numerical Flow Models for Controlled Fusion Porquerolles, April 2007 P. Hénon,
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
DISTRIBUTED COMPUTING
Data Structures and Algorithms in Parallel Computing Lecture 7.
Martin Kruliš by Martin Kruliš (v1.1)1.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
A Parallel Communication Infrastructure for STAPL
These slides are based on the book:
Auburn University
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Lecture 1 – Parallel Programming Primer
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
Introduction to Load Balancing:
CS5102 High Performance Computer Systems Thread-Level Parallelism
The Mach System Sri Ramkrishna.
Distributed Processors
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
CS 147 – Parallel Processing
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Parallel Programming in C with MPI and OpenMP
L21: Putting it together: Tree Search (Ch. 6)
Chapter 4: Threads.
Guoliang Chen Parallel Computing Guoliang Chen
Using Packet Information for Efficient Communication in NoCs
Chapter 4: Threads.
Outline Midterm results summary Distributed file systems – continued
Q: What Does the Future Hold for “Parallel” Languages?
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
Multithreaded Programming
EE 4xx: Computer Architecture and Performance Programming
Parallel Algorithm Models
Operating System Introduction.
Chapter 01: Introduction
Parallel Programming in C with MPI and OpenMP
CSL718 : Multiprocessors 13th April, 2006 Introduction
Presentation transcript:

1 Des bulles dans le PaStiX Réunion NUMASIS Mathieu Faverge ScAlApplix project, INRIA Futurs Bordeaux 29 novembre 2006

2 Introduction Objectives : Scheduling for NUMA architecture Used with direct and incomplete solver Relax static scheduling on each SMP node Integrate MPICH-mad library Work on Out of Core aspect Application : PaStiX

3 Direct Factorization techniques PaStiX key points Load-balancing and scheduling are based on a fine modeling of computation and communication Modern architecture management (SMP nodes) : hybrid Threads/MPI implementation Control of memory overhead due to aggregation of contributions in the supernodal block solver Scotch (ordering & amalgamation) Fax (block symbolic factorization) Blend (refinement & mapping) Sopalin (factorizing & solving) graphpartitionsymbolMatrix Distributed solverMatrix Distributed factorized solverMatrix Distributed solution

4 Matrix partitioning and mapping  Manage parallelism induced by sparsity (block elimination tree).  Split and distribute the dense blocks in order to take into account the potential parallelism induced by dense computations.  Use optimal block size for pipelined BLAS3 operations.

5 Supernodal Factorization Algorithm FACTOR(k): factorize diagonal block k Factorize A kk into L kk L t kk ; BDIV(j,k): update L jk (BLAS 3) Solve L kk L jk t = A t jk ; BMOD(i,j,k): compute contribution of L ik and L jk for block L ij (BLAS 3) A ij = A ij – L ik L jk t ; k LjkLjk LikLik AijAij j

6 Parallel Factorization Algorithm COMP1D(k): factorize column-block k and compute all contributions to column-block in BCol(k) Factorize A kk into L kk L t kk ; Solve L kk L t * = A t *k ; For j  BCol(k) Do Compute C [j] =L [j]k L jk t ; If map([j],j) == p Then A [j]j = A [j]j – C [j] ; Else AUB [j]j =AUB [j]j + C [j] ;

Local aggregation of block updates Column-block k1 and k2 are mapped on processor P1 Column-block j is mapped on processor P2 Contributions from the processor P1 to the block Aij of processor P2 are locally summed in AUBij Processors communicate using aggregate update blocks only  Critical Memory Overhead in particular for 3D Problems

Matrix Partitioning Tasks Graph Block Symbolic Matrix Costs Modeling (Comp/Comm) Number of Processors Mapping and Scheduling Local data Tasks Scheduling Communicati on Scheme Parallel Factorization and Solver Computation Time (estimate) Memory Allocation (during factorization)

9 Matrix partitioning and mapping

10 Hybrid 1D/2D Block Distribution Yield 1D and 2D block distributions BLAS efficiency on small supernodes  1D Scalability on larger supernodes  2D  Switching criterion 1D block distribution 2D block distribution

11 « 2D » to « 2D » communication scheme Dynamic technique is used to improve « 1D » to « 1D/2D » communication scheme Matrix partitioning and mapping

MPI/Thread for SMP implementation Mapping by processor Static scheduling by processor Each processor owns its local part of the matrix (private user space) Message passing (MPI or MPI_shared_memory) between any processors Aggregation of all contributions is done per processor Data coherency insured by MPI semantic Mapping by SMP node Static scheduling by thread All the processors on a same SMP node share a local part of the matrix (shared user space) Message passing (MPI) between processors on different SMP nodes Direct access to shared memory (pthread) between processors on a same SMP node Aggregation of non local contributions is done per node Data coherency insured by explicit mutex

MPI only Processor 1 and 2 belong to the same SMP node Data exchanges when only MPI processes are used in the parallelization

MPI/Threads Thread 1 and 2 are created by one MPI process Data exchanges when there is one MPI process per SMP node and one thread per processor

15 MPICH-Madeleine A communication support for clusters and multi-clusters Multiple network protocols : MPI, TCP, SCI, Myrinet,... Priority management Dynamic Aggregation of transfers Packet reordering Non deterministic

16 Library Marcel Thread user library from execution environment PM2 developed by Runtime project Portable Modular Efficient Extensible Monitorable Based on a user level scheduler thanks to criteria defined by application (memory affinity, load, task priority...)

17 Future SMP implementation Mapping by SMP node Static scheduling by thread No localisation of threads on SMP nodes One task = reception + compute + emission Aggregation of non local contributions is done per node Mapping by SMP node Dynamic scheduling by thread marcel Threads spread on SMP node by memory affinity or other criteria Separation of communication and computing tasks Aggregation made at mpich-mad level Utilisation of an adapted scheduling

18 Matrix partitioning and mapping 1 noeud SMP 1thread

19 Threads Mapping

20 Perspectives of Madeleine Utilization Aggregation made by Madeleine for a same destination Asynchronous management Adaptive packet size Managing communication priorities Separation of communication and computation tasks

21 Out-Of-Core Sparse Direct Solvers Sparse direct methods have large memory requirements Memory becomes the bottleneck  Use of out-of-core techniques to treat larger problems  Write unnecessary data to disk and prefetch it when needed  Design efficient I/O mechanisms and sophisticated preteching schemes Sparse solver Prefetching layer I/O abstraction layer fread/fwri te read/write Async. I/O Comp. thread I/O thread … Sync. layer I/O req. queue

22 Links Scotch : PaStiX : MUMPS : ScAlApplix : RUNTIME : ANR CIGCNumasis ANR CIS Solstice & Aster