11/18 SC 2003 MPICH-V2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging joint work with A.Bouteiller, F.Cappello,

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Threads, SMP, and Microkernels
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
Jaringan Informasi Pengantar Sistem Terdistribusi oleh Ir. Risanuri Hidayat, M.Sc.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Improved Message Logging versus Improved Coordinated Checkpointing For Fault Tolerant MPI Pierre Lemarinier joint work with.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Computer Science Lecture 11, page 1 CS677: Distributed OS Last Class: Clock Synchronization Logical clocks Vector clocks Global state.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
EECE 411: Design of Distributed Software Applications What is a Distributed System? You know when you have one … … when the failure of a computer you’ve.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.
Peer-to-Peer Distributed Shared Memory? Gabriel Antoniu, Luc Bougé, Mathieu Jan IRISA / INRIA & ENS Cachan/Bretagne France Dagstuhl seminar, October 2003.
Spring 2006Computer Networks1 Chapter 2 Network Models.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
A.Obaid - Wilfried Probst - Rufin Soh INE4481 DISTRIBUTED DATABASES & CLIENT-SERVER ARCHITECTURES1 Chapter 1. Distributed systems: Definitions, design.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
ISADS'03 Message Logging and Recovery in Wireless CORBA Using Access Bridge Michael R. Lyu The Chinese Univ. of Hong Kong
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Seminar On Rain Technology
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Last Class: Introduction
Introduction to Distributed Platforms
Jack Dongarra University of Tennessee
Definition of Distributed System
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
Multiple Processor Systems
CLUSTER COMPUTING.
Time Gathering Systems Secure Data Collection for IBM System i Server
EEC 688/788 Secure and Dependable Computing
Fault Tolerance with FT-MPI for Linear Algebra Algorithms
EEC 688/788 Secure and Dependable Computing
Distributed Systems (15-440)
Presentation transcript:

11/18 SC 2003 MPICH-V2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette Parallelism team, Grand Large Project Thomas Hérault  Grand Large

11/18 SC 2003 MPICH-V2 Computing nodes of clusters are subject to failure Many applications use MPI as communication library –Design a fault-tolerant MPI library MPICH-V1 is a fault-tolerant MPI implementation –It requires many stable components to provide high performance MPICH-V2 addresses this requirements –And provides higher performances

11/18 SC 2003 Outline Introduction Architecture Performances Perspective & Conclusion

11/18 SC 2003 Industry and academia are building larger and larger computing facilities for technical computing (research and production). Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids XtremWeb, Entropia, UD, Boinc) Large Scale Parallel and Distributed systems and node Volatility These large scale systems have frequent failures/disconnections: ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate. PC Grids nodes are volatile  disconnections / interruptions are expected to be very frequent (several/hour) When failures/disconnections can not be avoided, they become one characteristic of the system called Volatility  We need a Volatility tolerant Message Passing library

11/18 SC 2003 Goal: execute existing or new MPI Apps PC client MPI_send()PC client MPI_recv() Programmer’s view unchanged: Objective summary: 1) Automatic fault tolerance 2) Transparency for the programmer & user 3) Tolerate n faults (n being the #MPI processes) 4) Scalable Infrastructure/protocols 5) Avoid global synchronizations (ckpt/restart) 6) Theoretical verification of protocols Problems: 1) volatile nodes (any number at any time) 2) non named receptions (  should be replayed in the same order as the one of the previous failed exec.)

11/18 SC 2003 Related works Manetho n faults [EZ92] Egida [RAV99] MPI/FT Redundance of tasks [BNC01] FT-MPI Modification of MPI routines User Fault Treatment [FD00] MPICH-V2 N faults Distributed logging MPI-FT N fault Centralized server [LNLE00] Non AutomaticAutomatic Pessimistic log Log based Checkpoint based Causal log Optimistic log (sender based) Level Framework API Communication Lib. Cocheck Independent of MPI [Ste96] Starfish Enrichment of MPI [AF99] Clip Semi-transparent checkpoint [CLP97] Pruitt 98 2 faults sender based [PRU98] Sender based Mess. Log. 1 fault sender based [JZ87] Optimistic recovery In distributed systems n faults with coherent checkpoint [SY85] A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques. Coordinated checkpoint

11/18 SC 2003 The objective is to checkpoint the application when there is no in transit messages between any two nodes  global synchronization  network flush  not scalable Nodes Ckpt failure detection/ global stop restart Nodes Ckpt failure detection restart Coordinated Checkpoint (Chandy/Lamport) Uncoordinated Checkpoint Checkpoint techniques No global synchronization (scalable)  Nodes may checkpoint at any time (independently of the others)  Need to log undeterministic events: In-transit Messages Sync

11/18 SC 2003 Outline Introduction Architecture Performances Perspective & Conclusion

11/18 SC 2003 MPICH-V1 node Network node Dispatcher node Channel Memories Checkpoint servers node Get Network Put Channel Memory 2

11/18 SC 2003 Definition 3 (Pessimistic Logging protocol) Let P be a communication protocol, and E an execution of P with at most f concurrent failures. Let MC denotes the set of messages transmitted between the initial configuration and the configuration C of E. P is a pessimistic message logging protocol if and only if CE, m  M C, (|Depend C (m)| > 1) ) Re − Executable(m) MPICH-V2 protocol A new protocol (never published yet) based on 1) Splitting message logging and event logging 2) Sender based message logging 3) Pessimistic approach (reliable event logger) Theorem 2 The protocol of MPICH-V2 is a pessimistic message logging protocol. Key points of the proof: A. Every non deterministic event has its logical clock logged on reliable media B. Every message reception logged on reliable media is reexecutable the message payload is saved on the sender the sender will produce the message again and associate the same unique logical clock

11/18 SC 2003 Message logger and event logger q p r event logger for p crash q p r event logger for p restart reexecution phase A B C D m (id, l)

11/18 SC 2003 Computing node MPI process V2 daemon Send Receive Send Receive Event Logger Reception event keep payload Ckpt Server CSAC Checkpoint Image Ckpt Control Node ack

11/18 SC 2003 Impact of uncoordinated checkpoint + sender based message logging EL P0 P1 P1’s ML CS ? 1,2 Obligation to checkpoint Message Loggers on computing nodes  Garbage collector required for reducing ML checkpoint size. Checkpoint image ?

11/18 SC 2003 Garbage collection EL P0 P1 P1’s ML CS Receiver checkpoint completion triggers the garbage collector of senders. Checkpoint image 1 and 2 can be deleted  Garbage collector 12 33

11/18 SC 2003 Scheduling Checkpoint P0 P1 P1’s ML CS needs to be checkpointed 12 3 P0’s ML Uncoordinated checkpoint lead to log in-transit messages Scheduling checkpoint simultaneously will lead to bursts in the network traffic. Checkpoint size can be reduced by removing message logs  Coordinated checkpoint (Lamport).  Requires global synchronization Checkpoint traffic should be flattened Checkpoint scheduling should evaluate the cost and benefit of each checkpoint. 1 and 2 can be deleted  Garbage collector 1, 2 and 3 can be deleted  Garbage collector No message Checkpoint needed

11/18 SC 2003 Node (Volatile) : Checkpointing l User-level Checkpoint : Condor Stand Alone Checkpointing l Clone checkpointing + non blocking checkpoint code CSAC libmpichv Ckpt order (1) fork fork (2) Terminate ongoing coms (3) close sockets (4) call ckpt_and_exit() l Checkpoint image is sent to CS on the fly (not stored locally) Resume execution using CSAC just after (4), reopen sockets and return

11/18 SC 2003 ADI Channel Interface Chameleon Interface V2 device Interface MPI_Send MPID_SendControl MPID_SendChannel _v2bsend _v2brecv _v2probe _v2from _v2Init _v2Finalize - get the src of the last message - check for any message avail. - blocking send - blocking receive - initialize the client - finalize the client – A new device: ‘ch_v2’ device – All ch_v2 device functions are blocking communication functions built over TCP layer Library: based on MPICH Binding

11/18 SC 2003 Outline Introduction Architecture Performances Perspective & Conclusion

11/18 SC 2003 Performance evaluation Cluster: Athlon CPU, 1 GB, IDE Disc + 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc + 48 ports 100Mb/s Ethernet switch Linux , GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp) node Network node node A single reliable node Checkpoint Server +Event Logger +Checkpoint Scheduler +Dispatcher

11/18 SC 2003 Bandwidth and Latency Latency for a 0 byte MPI message : MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us) Latency is high due to the event logging.  A receiving process can send a new message only when the reception event has been successfully logged (3 TCP messages for a communication) Bandwidth is high because event messages are short.

11/18 SC 2003 NAS Benchmark Class A and B Megaflops Latency Memory capacity (logging on disc)

11/18 SC 2003 Breakdown of the execution time

11/18 SC 2003 Faulty execution performance +190 s (+80%) 1 fault Every 45 sec!

11/18 SC 2003 Outline Introduction Architecture Performances Perspective & Conclusion

11/18 SC 2003 Perspectives Compare to Coordinated techniques –Treshold of fault frequency where logging techniques are more valuable –MPICH-V/CL  Cluster 2003 Hierarchical logging for Grids –Tolerate node failures & cluster failures –MPICH-V3  SC 2003 Poster session Address the latency of MPICH-V2 –Use causal logging techniques ?

11/18 SC 2003 Conclusion MPICH-V2 is a completely new protocol replacing MPICH-V1 removing the channel memories New protocol is pessimistic sender based MPICH-V2 reach a Ping-Pong Bandwidth close to the one of MPICH-P4 MPICH-V2 cannot compete with MPICH-P4 on latency However for applications with large messages, performance are close to the one of P4 In addition, MPICH-V2 resists up to one fault every 45 seconds. Main conclusion: MPICH-V2 requires much less stable nodes than MPICH-V1 with better performances Come to see MPICH-V demos at the Booth: 3315 INRIA

11/18 SC 2003 Crash Time for the re-execution of a token ring on 8 nodes According to the token size and number of re-started nodes Re-execution performance (1)

11/18 SC 2003 Re-execution performance (2)

11/18 SC 2003 Logging techniques Initial execution ckpt crash Replayed execution : starts from last checkpoint (this process) The system must provide the messages to be replayed, and discard the re-emissions Main problems: Discard re-emissions (technical) Ensure that messages are replayed in a consistent order

11/18 SC 2003 Large Scale Parallel and Distributed Systems and programing Many HPC applications use message passing paradigm Message passing :MPI  We need a Volatility tolerant Message Passing Interface implementation Based on MPICH which implements MPI standard 1.1

11/18 SC 2003 Checkpoint Server (stable) Multiprocess server Checkpoint images are stored on reliable media: 1 file per Node (name given By Node) Incoming Message (Put ckpt transaction) Open Sockets: -one per attached Node -one per home CM of attached Nodes Poll, treat event and dispatch job to other processes Outgoing Message (Get ckpt transaction + control) Checkpoint images Disc

11/18 SC 2003 NAS Benchmark Class A and B Latency Memory capacity (logging on disc)