MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

MPI Message Passing Interface
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Practical techniques & Examples
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Today’s topics Single processors and the Memory Hierarchy
Parallel Research at Illinois Parallel Everywhere
Mapping Communication Layouts to Network Hardware Characteristics on Massive-Scale Blue Gene Systems Pavan Balaji*, Rinku Gupta*, Abhinav Vishnu + and.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Graph Analysis with High Performance Computing by Bruce Hendrickson and Jonathan W. Berry Sandria National Laboratories Published in the March/April 2008.
Memory Management 2010.
Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.
The hybird approach to programming clusters of multi-core architetures.
MPI at Exascale Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Adventures in Load Balancing at Scale: Successes, Fizzles, and Next Steps Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links Ernie Chan.
Synchronization and Communication in the T3E Multiprocessor.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Basic Communication Operations Based on Chapter 4 of Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis and Vipin Kumar These.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Network Aware Resource Allocation in Distributed Clouds.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Presented by High Productivity Language and Systems: Next Generation Petascale Programming Wael R. Elwasif, David E. Bernholdt, and Robert J. Harrison.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
Hybrid MPI and OpenMP Parallel Programming
Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.
Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.
MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.
Lawrence Livermore National Laboratory BRdeS-1 Science & Technology Principal Directorate - Computation Directorate How to Stop Worrying and Learn to Love.
MPI implementation – collective communication MPI_Bcast implementation.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
MPI Adriano Cruz ©2003 NCE/UFRJ e Adriano Cruz NCE e IM - UFRJ Summary n References n Introduction n Point-to-point communication n Collective.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Productive Performance Tools for Heterogeneous Parallel Computing
Advanced Computer Systems
Performance Evaluation of Adaptive MPI
Parallel Sorting Algorithms
Parallel Sorting Algorithms
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper Larsson Träff 4 1 Argonne National Laboratory 2 University of Illinois 3 IBM Watson Research Center 4 NEC Laboratories Europe

2 Introduction Systems with the largest core counts in June 2009 Top500 list Juelich BG/P 294,912 cores LLNL BG/L 212,992 cores Argonne BG/P 163,840 cores Oak Ridge Cray XT5 150,152 cores LLNL BG/P (Dawn) 147,456 cores In a few years, we will have systems with a million cores or more For example, in 2012, the Sequoia machine at Livermore will be an IBM Blue Gene/Q with 1,572,864 cores (~1.6 million cores)

3 MPI on Million Core Systems Vast majority of parallel scientific applications today use MPI Some researchers and users wonder (and perhaps even doubt) whether MPI will scale to large processor counts In this paper, we examine the issue of how scalable is MPI –What is needed in the MPI specification –What is needed from implementations We ran experiments on up to 131,072 processes on Argonne’s IBM BG/P (80% of the full machine) –Tuned MPI implementation to reduce memory requirements We consider issues in application algorithmic scalability and using MPI in other ways to improve scalability in applications

4 Factors Affecting Scalability Performance and memory consumption A nonscalable MPI function is one whose time or memory consumption per process increase linearly (or worse) with the number of processes (all else being equal) For example –If time taken by MPI_Comm_spawn increases linearly or more with the no. of processes being spawned, it indicates a nonscalable implementation of the function –If memory consumption of MPI_Comm_dup increases linearly with the no. of processes, it is not scalable Such examples need to be identified and fixed (in the specification and in implementations) The goal should be to use constructs that require only constant space per process

5 Scalability Issues in the MPI Specification Some function parameters are of size O(nprocs) –e.g., irregular (or “v”) version of collectives such as MPI_Gatherv –Extreme case: MPI_Alltoallw takes six such arrays On a million processes, that requires 24 MB on each process –On low-frequency cores, even scanning through large arrays takes time (see next slide) –MPI Forum is working to address this issue (proposal by Jesper and Torsten)

6 Zero-byte MPI_Alltoallv time on BG/P This is just the time to scan the parameter array to determine it is all 0 bytes. No communication performed.

7 Scalability Issues in the MPI Specification Graph Topology –In MPI 2.1 and earlier, requires the entire graph to be specified on each process –Fixed in MPI 2.2 – distributed graph topology One-sided communication –Synchronization functions turn out to be expensive –Being addressed by RMA working group of MPI-3 Representation of process ranks –Explicit representation of process ranks in some functions, such as MPI_Group_incl and MPI_Group_excl –Concise representations should be considered

8 Scalability Issues in the MPI Specification All-to-all communication –Not a scalable communication pattern –Applications may need to consider newer algorithms that do not require all-to-all Fault tolerance –Large component counts will result in frequent failures –Greater resilience needed from all components of the software stack –MPI can return error codes, but need more support than that –Being addressed in the fault tolerance group of MPI-3

9 MPI Implementation Scalability In terms of scalability, MPI implementations must pay attention to two aspects as the number of processes is increased: –memory consumption of any function, and –performance of all collective functions Not just collective communiation functions that are commonly optimized Also functions such as MPI_Init and MPI_Comm_split

10 Process Mappings MPI communicators maintain mapping from ranks to processor ids This mapping is often a table of O(nprocs) size in the communicator Need to explore more memory-efficient mappings, at least for common cases More systematic approaches to compact representations of permutations (research problem)

11 NEK5000: Communicator Memory Consumption NEK5000 code failed on BG/P at large scale because MPI ran out of communicator memory. We fixed the problem by using a fixed buffer pool within MPI and provided a patch to IBM.

12 MPI Memory Usage on BG/P after 32 calls to MPI_Comm_dup Using a buffer pool enables all collective optimizations and takes up only a small amount of memory

13 Scalability of MPI_Init Cluster with 8 cores per node. TCP/IP across nodes Setting up all connections at Init time is too expensive at large scale; must be done on demand as needed

14 Scalable Algorithms for Collective Communication MPI implementations typically use –O(lg p) algorithms for short messages (binomial tree) –O(m) algorithms, where m=message size, for large messages E.g., bcast implemented as scatter + allgather O(lg p) algorithms can still be used on a million processors for short messages However, O(m) algorithms for large messages may not scale, as the message size in the allgather phase can get very small –E.g., for a 1 MB bcast on a million processes, the allgather phase involves 1 byte messages Hybrid algorithms that do logarithmic bcast to a subset of nodes, followed by scatter/allgather may be needed Topology-aware pipelined algorithms may be needed Use network hardware for broadcast/combine

15 Enabling Application Scalability Applications face the challenge of scaling up to large numbers of processors A basic question is –Is the parallel algorithm used by the application itself scalable (independent of MPI)? –Needs to be fixed by the application Some features of MPI that may not be currently used by an application could play an important role in enabling the application to run effectively on more processors In many cases, application code may not require much change

16 Higher Dimensional Decompositions with MPI Many applications use 2D or 3D meshes, but parallel decomposition is only along one dimension Results in contiguous buffers for MPI sends and receives Simple, but not the most efficient for large numbers of processors 2D or 3D decompositions are more efficient Results in noncontiguous communication buffers for sending and receiving edge or face data MPI (or a library) can help by providing functions for assembling MPI datatypes that describe these noncontiguous areas Efficient support for derived datatypes needed

17 Enabling Hybrid Programming Million processors need not mean million processes On future machines, as the amount of memory per core decreases, applications may want to use a shared-memory programming model on a multicore node, and MPI across nodes MPI supports this transition by having clear semantics for interoperation with threads Four levels of thread safety that can be required by an application and provided by an implementation Works as MPI+OpenMP or MPI+Pthreads or other approaches Hybrid programming working group in MPI-3 Forum exploring further enhancements to support efficient hybrid programming See Marc Snir proposal on “end points”

18 Use of MPI-Based Libraries to Hide Complexity MPI allows you to build higher-level libraries that provide extremely simple programming models that are both useful and scalable Example: –Asynchronous Dynamic Load Balancing Library (ADLB) –Used in GFMC (Green’s Function Monte Carlo) code in UNEDF SciDAC project –GFMC used a nontrivial master-worker model with a single master; didn’t scale beyond 2000 processes on Blue Gene Master Worker Shared Work queue Shared Work queue

19 ADLB Library Provides a scalable distributed work queue, with no one master Application processes can simply put and get work units to the queue Implemented on top of MPI, hence is portable Enables GFMC application to scale beyond 30,000 processes Worker Shared Work queue Shared Work queue

20 Conclusions MPI is ready for scaling to a million processors barring a few issues that can be (and are being) fixed Nonscalable parts of the MPI standard include irregular collectives and virtual graph topology Need for investigating systematic approaches to compact, adaptive representations of process groups MPI implementations must pay careful attention to the memory requirements and eliminate data structures whose size grows linearly with the number of processes For collectives, MPI implementations may need to become more topology aware or rely on global collective acceleration support MPI’s support for building libraries and clear semantics for interoperation with threads enable applications to use other techniques to scale when limited by memory or data size