1Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 1Managed by UT-Battelle for the Department of Energy Richard L. Graham Computer.

Slides:



Advertisements
Similar presentations
Support for Fault Tolerance (Dynamic Process Control) Rich Graham Oak Ridge National Laboratory.
Advertisements

Presented by Fault Tolerance and Dynamic Process Control Working Group Richard L Graham.
March 2008MPI Forum Voting 1 MPI Forum Voting March 2008.
MPI ABI WG Status Jeff Brown, LANL (chair) April 28, 2008.
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Use Cases for Fault Tolerance Support in MPI Rich Graham Oak Ridge National Laboratory.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Distributed Systems Architectures Slide 1 1 Chapter 9 Distributed Systems Architectures.
Transaction.
Chapter 19: Network Management Business Data Communications, 4e.
Ameoba Designed by: Prof Andrew S. Tanenbaum at Vrija University since 1981.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Slide 1 Client / Server Paradigm. Slide 2 Outline: Client / Server Paradigm Client / Server Model of Interaction Server Design Issues C/ S Points of Interaction.
6/4/2015Page 1 Enterprise Service Bus (ESB) B. Ramamurthy.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Distributed Systems Architectures
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Minaashi Kalyanaraman Pragya Upreti CSS 534 Parallel Programming
Lessons Learned Implementing User-Level Failure Mitigation in MPICH Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji Argonne National Laboratory User-level.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.
Team Members Lora zalmover Roni Brodsky Academic Advisor Professional Advisors Dr. Natalya Vanetik Prof. Shlomi Dolev Dr. Guy Tel-Zur.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Presented by The MPI Forum Richard L Graham. Outline  Forum Structure  Schedules  Introductions  Scope  Voting Rules  Committee Rules.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
1 소프트웨어공학 강좌 Chap 9. Distributed Systems Architectures - Architectural design for software that executes on more than one processor -
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 11Slide 1 Chapter 11 Distributed Systems Architectures.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan
Networks – Network Architecture Network architecture is specification of design principles (including data formats and procedures) for creating a network.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
Distributed Systems: Concepts and Design Chapter 1 Pages
Distributed Data Mining System in Java Group Member D 王春笙 D 林俊甫 D 王慧芬.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Distributed Systems and Algorithms Sukumar Ghosh University of Iowa Spring 2011.
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Presented by Open MPI on the Cray XT Richard L. Graham Tech Integration National Center for Computational Sciences.
Sending large message counts (The MPI_Count issue)
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Seminar On Rain Technology
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Presented by SciDAC-2 Petascale Data Storage Institute Philip C. Roth Computer Science and Mathematics Future Technologies Group.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
TensorFlow– A system for large-scale machine learning
Jack Dongarra University of Tennessee
University of Technology
Advanced Operating Systems
Presentation transcript:

1Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 1Managed by UT-Battelle for the Department of Energy Richard L. Graham Computer Science and Mathematics Division National Center for Computational Sciences Fault Tolerance and MPI - Can They Coexist ?

2Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Outline  Problem definition  General approach for making MPI fault tolerant  Current status  Is this all ? Goal: Let MPI survive partial system failure

3Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Problem definition Disk B Node 2 Disk A Node 1 Network

4Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Problem definition – A bit more realistic Disk B Node 2 Node 1 Network

5Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Failure example – node failure Disk B Node 2 Node 1 Network

6Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08  Problem: A component affecting a running MPI job is compromised (H/W or S/W)  Question: Can the MPI application continue to run correctly ? –Does the job have to abort ? –If not, can the job continue to communicate ? –Can there be a change in resources available to the job ?

7Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 The Challenge  We need approaches to dealing with fault- tolerance at scale that will: –Allow applications to harness full system capabilities –Work well at scale

8Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Technical Guiding Principles  End goal: Increase application MTBF –Applications must be willing to use the solution  No One-Solution-Fits-All –Hardware characteristics –Software characteristics –System complexity –System resources available for fault recovery –Performance impact on application –Fault characteristics of application  Standard should not be constrained by current practice

9Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Why MPI ?  The Ubiquitous standard parallel programming model used in scientific computing today  Minimize disruption of the running application

10Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 What role should MPI play in recovery ?  MPI does NOT provide fault-tolerance  MPI should enable the survivability of MPI upon failure.

11Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 What role should MPI play in recovery ? – Cont’d  MPI provides: –Communication primitives –Management of groups of processes –Access to the file system

12Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 What role should MPI play in recovery ? – Cont’d  Therefore upon failure MPI should: (limited by system state) –Restore MPI communication infrastructure to correct and consistent state –Restore process groups to a well defined state –Able to reconnect to file system –Provide hooks related to MPI communications needed by other protocols building on top of MPI, such as  Flush the message system  Quiesce the network  Send “piggyback” data  ?

13Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 What role should MPI play in recovery ? – Cont’d  MPI is responsible for making the internal state of MPI consistent and usable by the application  The “application” is responsible for restoring application state Layered Approach

14Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 CURRENT WORK 14

15Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Active Members in the Current MPI Forum Argonne NL, Bull, Cisco, Cray, Fujitsu, HDF Group, HLRS, HP, IBM, INRIA, Indiana U., Intel, Lawrence Berkeley NL, Livermore NL, Los Alamos NL, Mathworks, Microsoft, NCSA/UIUC, NEC, Oak Ridge NL, Ohio State U., Pacific NW NL, Qlogic, Sandia NL, SiCortex, Sun Microsystems, Tokyo Institute of Technology, U. Alabama Birmingham, U. Houston, U. Tennessee Knoxville, U. Tokyo 15

16Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Motivating Examples

17Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Collecting specific use case scenarios  Process failure – Client/Server, with client member of inter-communicator –Client process fails –Server is notified of failure –Server disconnects from Client inter- communicator, and continues to run –Client processes are terminated 17

18Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Collecting specific use case scenarios  Process failure – Client/Server, with client member of intra-communicator –Client process fails –Processes communicating with failed process are notified of failure –Application specifies response to failure  Abort  Continue with reduced process count, with the missing process being labled MPI_Proc_null in the communicator  Replace the failed process (why not allow to increase the size of the communicator ?) 18

19Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Collecting specific use case scenarios  Process failure – Tightly coupled simulation, with independent layered libraries –Process fails  Example application: POP (ocean simulation code) using conjugate gradient solver –Application specifies MPI’s response to failure 19

20Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Design Details  Allow for local recovery, when global recovery is not needed (scalability) –Collective communications are global in nature, therefore global recovery is required for continued use of collective communications  Delay recovery response as much as possible  Allow for rapid and fully coordinated recovery

21Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Design Details – Cont’d  Key component: Communicator –A functioning communicator is required to continue MPI communications after failure  Disposition of active communicators: –Application specified –MPI_COMM_WORLD must be functional to continue  Handling of surviving processes –MPI_comm_rank does not change –MPI_Comm_size does not change

22Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Design Details – Cont’d Handling of failed processes –Replace –Discard (MPI_PROC_NULL)  Disposition of active communications: –With failed process: discard –With other surviving processes – application defined on a per communicator basis

23Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Error Reporting Mechanisms – current status –Errors are associated with communicators –By default errors are returned from the affected MPI calls, are are returned synchronously Example: Ret1=MPI_Isend(comm=MPI_Comm_world, dest=3, …request=request3) Link to 3 fails Ret2=MPI_Isend(comm=MPI_Comm_world, dest=4, …request=request4) Ret3=Wait(request=request4) // success Ret4=Wait(request=request3) // error returned in Ret Can ask for more information about the failure 23

24Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08  May request global event notification –Several open questions remain here 24

25Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 A collective call has been proposed to check on the status of the communicator No extra cost is incurred for consistency checks, unless requested (may be relevant for collective operations) Provides global communicator state just before the call is made 25

26Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Examples of proposed API modifications Surviving Processes  MPI_COMM_IRECOVER(ranks_to_restore, request, return_code) –IN ranks_to_restore array of ranks to restore (struct) –OUT request request object (handle) –OUT return_code return error code(integer)  MPI_COMM_IRECOVER_COLLECTIVE(ranks _to_restore, request, return_code) –IN ranks_to_restore array of ranks to restore (struct) –OUT request request ob ject (handle) –OUT return_code return error code(integer)

27Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Examples of proposed API modifications Restored Processes MPI_RESTORED_PROCESS(generation, return_code) OUT generation Process generation (integer) OUT return_code return error code (integer) MPI_GET_LOST_COMMUNICATORS(comm_names, count, return_code) OUT comm_names Array of communicators that may be restored (strings) OUT count Number of Communicators that may be restored (in- teger) OUT return_code return error code(integer)

28Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Examples of proposed API modifications Restored Processes – Cont’d MPI_COMM_REJOIN(comm_names, comm, return_code) IN comm_names Communicator name (string) OUT comm communicator (handle) OUT return_code return error code(integer)

29Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Open Questions  How heavy handed do we need to be at the standard specification level to recover from failure in the middle of a collective operation ? Is this more than an implementation issue ? (Performance is the fly in the ointment)  What is the impact of “repairing” a communicator on implementation of collective algorithms (do we have to pay the cost all the time?) 29

30Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Is this All ?  Other aspects of fault tolerance –Network –Checkpoint/Restart –File I/O  End-to-end solution

31Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 For involvement in the process see: meetings.mpi-forum.org 31

32Managed by UT-Battelle for the Department of Energy Graham_OpenMPI_SC08 Backup slides