PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Slides:

Advertisements

Similar presentations

Presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham.

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Threads, SMP, and Microkernels

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Enabling MPI Interoperability Through Flexible Communication Endpoints

LACP Project Proposal.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Spark: Cluster Computing with Working Sets

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-

Distributed components

Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

Runtime Support for Irregular Computations in MPI-Based Applications - CCGrid 2015 Doctoral Symposium - Xin Zhao *, Pavan Balaji † (Co-advisor), William.

Grid IO APIs William Gropp Mathematics and Computer Science Division.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 11 Slide 1 Architectural Design.

Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.

Implementing MPI on Windows: Comparison with Common Approaches on Unix Jayesh Krishna, 1 Pavan Balaji, 1 Ewing Lusk, 1 Rajeev Thakur, 1 Fabian Tillier.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Computer System Architectures Computer System Software

MPI3 Hybrid Proposal Description

1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.

GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,

Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.

LWIP TCP/IP Stack 김백규.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.

Architectural Design lecture 10. Topics covered Architectural design decisions System organisation Control styles Reference architectures.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Efficient Multithreaded Context ID Allocation in MPI James Dinan, David Goodell, William Gropp, Rajeev Thakur, and Pavan Balaji.

A Scalable Virtual Registry Service for jGMA Matthew Grove DSG Seminar 3 rd May 2005.

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.

Argonne National Laboratory + University of Chicago1 Users of a Process Manager Process Manager Application (e.g. MPI library) Interactive User Queue Manager.

Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.

Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.

 Dan Ibanez, Micah Corah, Seegyoung Seol, Mark Shephard  2/27/2013  Scientific Computation Research Center  Rensselaer Polytechnic Institute 1 Advances.

Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.

Seminar On Rain Technology

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

SEMINAR TOPIC ON “RAIN TECHNOLOGY”

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Self Healing and Dynamic Construction Framework:

Async or Parallel? No they aren’t the same thing!

Grid Computing.

Department of Computer Science University of California, Santa Barbara

Pluggable Architecture for Java HPC Messaging

Hybrid Programming with OpenMP and MPI

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing Lusk and Rajeev Thakur

Introduction  Process management is integral part of HPC  Scalability and performance are critical  Close interaction between process management and parallel library (e.g., MPI) is important –Need not be integrated  Separation allows –Independent development and improvement –Parallel libraries portable to different environments 2

PMI  Generic process management interface for parallel applications  PMI-1 is widely used –MVAPICH, Intel MPI, Microsoft MPI –SLURM, OSC mpiexec, OSU mpirun  Introducing PMI-2 –Improved scalability –Better interaction with hybrid MPI+threads 3

PMI Functionality  Process management –Launch and monitoring Initial job Dynamic processes –Process control  Information exchange –Contact information –Environmental attributes 4

System Model 5

6 MPICH2 MVAPICH2 Intel-MPI SCX-MPI Microsoft MPI Simple PMI SMPD PMI SLURM PMI BG/L PMI Hydra MPD SLURM SMPD OSC mpiexec OSU mpirun Communication Subsystem MPI Library PMI API PMI Library Process Manager

 Handles –Process launch Start and stop processes Forwarding I/O and signals –Information exchange Contact information to set up communication  Implementation –May be separate components –May be distributed  E.g., PBS, Sun Grid Engine, SSH 7

PMI Library  Provides interface between parallel library and process manager  Can be system specific –E.g, BG/L uses system specific features  Wire protocol between PMI library and PM –PMI-1 and PMI-2 have specified wire protocols –Allows PMI lib to be used with different PM –Note: wire protocol and PMI API are separate entities PMI implementation need not have wire protocol 8

PMI API  PMI-1 and PMI-2  Functions associated with –Initialization and finalization Init, Finalize, Abort –Information exchange Put, Get, Fence –Process creation Spawn 9

Information Exchange  Processes need to exchange connection info  PMI uses a Key-Value database (KVS)  At init, processes Put contact information –E.g., IP address and port  Processes Get contact info when establishing connections  Collective Fence operation to allow optimizations 10

Connection Data Exchange Example  At init –Proc 0 Puts (key=“bc-p0”, value=“ ;3893”) –Proc 1 Puts (key=“bc-p1”, value=“ ;2897”) –Proc 0 and 1 call Fence PM can collectively distribute database  Later Proc 0 wants to send a message to Proc 1 –Proc 0 does a Get of key “bc-p1” Receives value “ ;2897” –Proc 0 can now connect to Proc 1 11

Implementation Considerations  Allow the use of “native” process manager with low overhead –Systems often have existing PM E.g., integrated with resource manager –Minimize async processing and interrupts  Scalable data exchange –Distributed process manager –Collective Fence provides opportunity for scalable collective exchange 12

Second Generation PMI 13

New PMI-2 Features  Attribute query functionality  Database scope  Thread safety  Dynamic processes  Fault tolerance 14

PMI-2 Attribute Query Functionality  Process and resource managers have system- specific information –Node topology, network topology, etc.  Without this, processes need to determine this themselves –Each process gets each other's contact-info to discover local processes –O(p 2 ) queries 15

PMI-2 Database Scope  Previously KVS had only global scope  PMI-2 adds node-level scoping –E.g., keys for shared memory segments  Allows for optimized storage and retrieval of values 16

PMI-2 Thread Safety  PMI-1 is not thread safe –All PMI calls must be serialized Wait for request and response –Can affect multithreaded programs  PMI-2 adds thread safety –Multiple threads can call PMI functions –One call cannot block the completion of another 17

PMI-2 Dynamic Processes  In PMI-1 a separate database is maintained for each MPI_COMM_WORLD (process group) –Queries are not allowed across databases –Requires out-of-band exchange of databases  PMI-2 allows cross-database queries –Spawned or connected process groups can now query each other’s databases –Only process group ids need to be exchanged 18

PMI-2 Fault Tolerance  PMI-1 provides no mechanism for respawning a failed process –New processes can be spawned, but they have a unique rank and process group  Respawn is critical for supporting fault- tolerance –Not just for MPI but other programming models 19

Evaluation and Analysis 20

Evaluation and Analysis  PMI-2 implemented in Hydra process manager  Evaluation –System information query performance –Impact of added PMI functionality over native PM –Multithreaded performance 21

System Information Query Performance  PMI-1 provides no attribute query functionality –Processes must discover local processes –O(p 2 ) queries  PMI-2 has node topology attribute  Benchmark (5760 cores on SiCortex) –MPI_Init();MPI_Finalize(); 22

Process Launch (5760-core SiCortex) Launch TimePMI Request Count 23

Impact of PMI Functionality  Systems often have integrated process managers –Not all provide PMI functionality  Efficient PMI implementation must make effective use of native process managers –Minimizing overhead  Benchmark (1600 cores on SiCortex) –Class C BT, EP and SP –Using SLURM (which provides PMI-1) –Using Hydra over SLURM (for launch and management) plus PMI daemon 24

Runtime Impact of Separate PMI Daemons (1600 cores SiCortex) Absolute PerformancePercentage Variation 25

Multithreaded Performance  PMI-1 is not thread safe –External locking is needed  PMI-2 is thread safe –Multiple threads can communicate with PM –Hydra: lock only for internal communication  Benchmark (8-core x86_64 node) –Multiple threads calling MPI_Publish_name(); MPI_Unpublish _name() –Work is fixed, number of threads increases 26

Multithreaded Performance 27

Conclusion 28

Conclusion  We presented a generic process management interface PMI  PMI-2: second generation eliminates PMI-1 shortcomings –Scalability issues for multicore systems –Issues for hybrid MPI-with-threads –Fault tolerance  Performance evaluation –PMI-2 allows better implementations over PMI-1 –Low overhead implementation over existing PM 29