11/18 SC 2003 MPICH-V2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging joint work with A.Bouteiller, F.Cappello, G.Krawezik, P.Lemarinier, F.Magniette Parallelism team, Grand Large Project Thomas Hérault Grand Large
11/18 SC 2003 MPICH-V2 Computing nodes of clusters are subject to failure Many applications use MPI as communication library –Design a fault-tolerant MPI library MPICH-V1 is a fault-tolerant MPI implementation –It requires many stable components to provide high performance MPICH-V2 addresses this requirements –And provides higher performances
11/18 SC 2003 Outline Introduction Architecture Performances Perspective & Conclusion
11/18 SC 2003 Industry and academia are building larger and larger computing facilities for technical computing (research and production). Platforms with 1000s of nodes are becoming common: Tera Scale Machines (US ASCI, French Tera), Large Scale Clusters (Score III, etc.), Grids, PC-Grids XtremWeb, Entropia, UD, Boinc) Large Scale Parallel and Distributed systems and node Volatility These large scale systems have frequent failures/disconnections: ASCI-Q full system MTBF is estimated (analytic) to few hours (Petrini: LANL), A 5 hours job with 4096 procs has less than 50% chance to terminate. PC Grids nodes are volatile disconnections / interruptions are expected to be very frequent (several/hour) When failures/disconnections can not be avoided, they become one characteristic of the system called Volatility We need a Volatility tolerant Message Passing library
11/18 SC 2003 Goal: execute existing or new MPI Apps PC client MPI_send()PC client MPI_recv() Programmer’s view unchanged: Objective summary: 1) Automatic fault tolerance 2) Transparency for the programmer & user 3) Tolerate n faults (n being the #MPI processes) 4) Scalable Infrastructure/protocols 5) Avoid global synchronizations (ckpt/restart) 6) Theoretical verification of protocols Problems: 1) volatile nodes (any number at any time) 2) non named receptions ( should be replayed in the same order as the one of the previous failed exec.)
11/18 SC 2003 Related works Manetho n faults [EZ92] Egida [RAV99] MPI/FT Redundance of tasks [BNC01] FT-MPI Modification of MPI routines User Fault Treatment [FD00] MPICH-V2 N faults Distributed logging MPI-FT N fault Centralized server [LNLE00] Non AutomaticAutomatic Pessimistic log Log based Checkpoint based Causal log Optimistic log (sender based) Level Framework API Communication Lib. Cocheck Independent of MPI [Ste96] Starfish Enrichment of MPI [AF99] Clip Semi-transparent checkpoint [CLP97] Pruitt 98 2 faults sender based [PRU98] Sender based Mess. Log. 1 fault sender based [JZ87] Optimistic recovery In distributed systems n faults with coherent checkpoint [SY85] A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques. Coordinated checkpoint
11/18 SC 2003 The objective is to checkpoint the application when there is no in transit messages between any two nodes global synchronization network flush not scalable Nodes Ckpt failure detection/ global stop restart Nodes Ckpt failure detection restart Coordinated Checkpoint (Chandy/Lamport) Uncoordinated Checkpoint Checkpoint techniques No global synchronization (scalable) Nodes may checkpoint at any time (independently of the others) Need to log undeterministic events: In-transit Messages Sync
11/18 SC 2003 Outline Introduction Architecture Performances Perspective & Conclusion
11/18 SC 2003 MPICH-V1 node Network node Dispatcher node Channel Memories Checkpoint servers node Get Network Put Channel Memory 2
11/18 SC 2003 Definition 3 (Pessimistic Logging protocol) Let P be a communication protocol, and E an execution of P with at most f concurrent failures. Let MC denotes the set of messages transmitted between the initial configuration and the configuration C of E. P is a pessimistic message logging protocol if and only if CE, m M C, (|Depend C (m)| > 1) ) Re − Executable(m) MPICH-V2 protocol A new protocol (never published yet) based on 1) Splitting message logging and event logging 2) Sender based message logging 3) Pessimistic approach (reliable event logger) Theorem 2 The protocol of MPICH-V2 is a pessimistic message logging protocol. Key points of the proof: A. Every non deterministic event has its logical clock logged on reliable media B. Every message reception logged on reliable media is reexecutable the message payload is saved on the sender the sender will produce the message again and associate the same unique logical clock
11/18 SC 2003 Message logger and event logger q p r event logger for p crash q p r event logger for p restart reexecution phase A B C D m (id, l)
11/18 SC 2003 Computing node MPI process V2 daemon Send Receive Send Receive Event Logger Reception event keep payload Ckpt Server CSAC Checkpoint Image Ckpt Control Node ack
11/18 SC 2003 Impact of uncoordinated checkpoint + sender based message logging EL P0 P1 P1’s ML CS ? 1,2 Obligation to checkpoint Message Loggers on computing nodes Garbage collector required for reducing ML checkpoint size. Checkpoint image ?
11/18 SC 2003 Garbage collection EL P0 P1 P1’s ML CS Receiver checkpoint completion triggers the garbage collector of senders. Checkpoint image 1 and 2 can be deleted Garbage collector 12 33
11/18 SC 2003 Scheduling Checkpoint P0 P1 P1’s ML CS needs to be checkpointed 12 3 P0’s ML Uncoordinated checkpoint lead to log in-transit messages Scheduling checkpoint simultaneously will lead to bursts in the network traffic. Checkpoint size can be reduced by removing message logs Coordinated checkpoint (Lamport). Requires global synchronization Checkpoint traffic should be flattened Checkpoint scheduling should evaluate the cost and benefit of each checkpoint. 1 and 2 can be deleted Garbage collector 1, 2 and 3 can be deleted Garbage collector No message Checkpoint needed
11/18 SC 2003 Node (Volatile) : Checkpointing l User-level Checkpoint : Condor Stand Alone Checkpointing l Clone checkpointing + non blocking checkpoint code CSAC libmpichv Ckpt order (1) fork fork (2) Terminate ongoing coms (3) close sockets (4) call ckpt_and_exit() l Checkpoint image is sent to CS on the fly (not stored locally) Resume execution using CSAC just after (4), reopen sockets and return
11/18 SC 2003 ADI Channel Interface Chameleon Interface V2 device Interface MPI_Send MPID_SendControl MPID_SendChannel _v2bsend _v2brecv _v2probe _v2from _v2Init _v2Finalize - get the src of the last message - check for any message avail. - blocking send - blocking receive - initialize the client - finalize the client – A new device: ‘ch_v2’ device – All ch_v2 device functions are blocking communication functions built over TCP layer Library: based on MPICH Binding
11/18 SC 2003 Outline Introduction Architecture Performances Perspective & Conclusion
11/18 SC 2003 Performance evaluation Cluster: Athlon CPU, 1 GB, IDE Disc + 16 Dual Pentium III, 500 Mhz, 512 MB, IDE Disc + 48 ports 100Mb/s Ethernet switch Linux , GCC 2.96 (-O3), PGI Frotran <5 (-O3, -tp=athlonxp) node Network node node A single reliable node Checkpoint Server +Event Logger +Checkpoint Scheduler +Dispatcher
11/18 SC 2003 Bandwidth and Latency Latency for a 0 byte MPI message : MPICH-P4 (77us), MPICH-V1 (154us), MPICH-V2 (277us) Latency is high due to the event logging. A receiving process can send a new message only when the reception event has been successfully logged (3 TCP messages for a communication) Bandwidth is high because event messages are short.
11/18 SC 2003 NAS Benchmark Class A and B Megaflops Latency Memory capacity (logging on disc)
11/18 SC 2003 Breakdown of the execution time
11/18 SC 2003 Faulty execution performance +190 s (+80%) 1 fault Every 45 sec!
11/18 SC 2003 Outline Introduction Architecture Performances Perspective & Conclusion
11/18 SC 2003 Perspectives Compare to Coordinated techniques –Treshold of fault frequency where logging techniques are more valuable –MPICH-V/CL Cluster 2003 Hierarchical logging for Grids –Tolerate node failures & cluster failures –MPICH-V3 SC 2003 Poster session Address the latency of MPICH-V2 –Use causal logging techniques ?
11/18 SC 2003 Conclusion MPICH-V2 is a completely new protocol replacing MPICH-V1 removing the channel memories New protocol is pessimistic sender based MPICH-V2 reach a Ping-Pong Bandwidth close to the one of MPICH-P4 MPICH-V2 cannot compete with MPICH-P4 on latency However for applications with large messages, performance are close to the one of P4 In addition, MPICH-V2 resists up to one fault every 45 seconds. Main conclusion: MPICH-V2 requires much less stable nodes than MPICH-V1 with better performances Come to see MPICH-V demos at the Booth: 3315 INRIA
11/18 SC 2003 Crash Time for the re-execution of a token ring on 8 nodes According to the token size and number of re-started nodes Re-execution performance (1)
11/18 SC 2003 Re-execution performance (2)
11/18 SC 2003 Logging techniques Initial execution ckpt crash Replayed execution : starts from last checkpoint (this process) The system must provide the messages to be replayed, and discard the re-emissions Main problems: Discard re-emissions (technical) Ensure that messages are replayed in a consistent order
11/18 SC 2003 Large Scale Parallel and Distributed Systems and programing Many HPC applications use message passing paradigm Message passing :MPI We need a Volatility tolerant Message Passing Interface implementation Based on MPICH which implements MPI standard 1.1
11/18 SC 2003 Checkpoint Server (stable) Multiprocess server Checkpoint images are stored on reliable media: 1 file per Node (name given By Node) Incoming Message (Put ckpt transaction) Open Sockets: -one per attached Node -one per home CM of attached Nodes Poll, treat event and dispatch job to other processes Outgoing Message (Get ckpt transaction + control) Checkpoint images Disc
11/18 SC 2003 NAS Benchmark Class A and B Latency Memory capacity (logging on disc)