Improved Message Logging versus Improved Coordinated Checkpointing For Fault Tolerant MPI Pierre Lemarinier joint work with.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Multiple Processor Systems
IEEE INFOCOM 2004 MultiNet: Connecting to Multiple IEEE Networks Using a Single Wireless Card.
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.
Multiple Processor Systems
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
3.5 Interprocess Communication
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
DISTRIBUTED COMPUTING
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
11/18 SC 2003 MPICH-V2 a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging joint work with A.Bouteiller, F.Cappello,
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.
Coordinated Checkpointing Presented by Sarah Arnold 1.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Parallel Checkpointing - Sathish Vadhiyar. Introduction  Checkpointing? storing application’s state in order to resume later.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Seminar On Rain Technology
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Prepared by Ertuğrul Kuzan
Distributed Network Traffic Feature Extraction for a Real-time IDS
Jack Dongarra University of Tennessee
Definition of Distributed System
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Multiple Processor Systems
Middleware for Fault Tolerant Applications
EEC 688/788 Secure and Dependable Computing
User-level Distributed Shared Memory
Fault Tolerance with FT-MPI for Linear Algebra Algorithms
MPJ: A Java-based Parallel Computing System
EEC 688/788 Secure and Dependable Computing
Database System Architectures
Computer Networks Protocols
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Improved Message Logging versus Improved Coordinated Checkpointing For Fault Tolerant MPI Pierre Lemarinier joint work with F.Cappello, A. Bouteiller, T. Herault, G.Krawezik Grand Large Project  Grand Large

Fault tolerance: Why?  Current trend: increase of the number of processor More components increases fault probability  Many numerical applications uses MPI (Message Passing Interface) library Need for automatic fault tolerant MPI

Outline  Protocols and Related works  Comparison framework  Performance  Conclusion and future works

Fault Tolerant protocols Problem of inconsistent states  Uncoordinated checkpoint : the problem of inconsistent states  Order of message receptions are non deterministic events messages received but not sent are inconsistent  Domino effect can lead to rollback to the begining of the execution in case of even a single fault Possible loose of the whole execution and unpredictive fault cost P0 P1 P2 C3 C1 m1 m2 m3 C2

Fault Tolerant protocols Global Checkpoint  Coordinated checkpoint  All processes coordinate their checkpoints so that the global system state is coherent (Chandy & Lamport Algorithm) Negligible overhead on fault free execution  Requires global synchronization (may take a long time to perform checkpoint because of checkpoint server stress)  In the case of a single fault, all processes have to roll back to their checkpoints High cost of fault recovery Efficient when fault frequency is low

Fault tolerant protocols Message Log 1/2  Pessimistic log  All messages received by a process are logged on a reliable media before it can causally influence the rest of the system Non negligible overhead on network performance in fault free execution  No need to perform global synchronization Does not stress checkpoint servers  No need to roll back non failed processes Fault recovery overhead is limited Efficient when fault frequency is high

Fault tolerant protocols Message Log 2/2  Causal log  Designed to improve fault free performance of pessimistic log  Messages are logged locally and causal dependencies are piggybacked to messages Non negligible overhead on fault free execution, slightly better than pessimistic log  No global synchronisation Does not stress checkpoint server  Only failed processes are rolled back  Failed Processes retrieve their state from dependant processes or no process depends on it. Fault recovery overhead is limited but greater than pessimistic log

Fault tolerant MPI A classification of fault tolerant message passing environments considering A) level in the software stack where fault tolerance is managed and B) fault tolerance techniques. Sender based Mess. Log. 1 fault sender based Manetho n faults Egida MPI/FT Redundance of tasks FT-MPI Modification of MPI routines User Fault Treatment MPI-FT N fault Centralized server Non AutomaticAutomatic Pessimistic log Log based coordinated based Causal log Optimistic log Level Framework API Communication Lib. Cocheck Independent of MPI Starfish Enrichment of MPI Clip Semi-transparent checkpoint Pruitt 98 2 faults sender based Optimistic recovery In distributed systems n faults with coherent checkpoint Coordinated checkpoint MPICH-V N faults Several protocols to perform fault tolerance in MPI applications with N faults and automatic recovery : Global checkpointing, Pessimistic/Causal Message log compare fault tolerant protocols for a single MPI implementation LAMMPI

Comparison: Related works  Egida compared log based techniques Siram Rao, Lorenzo Alvisi, Harrick M. Vim: The cost of recovery in message logging protocols. In 17th symposium on Reliable Distributed Systems (SRDS), pages 10-18, IEEE Press, October Causal log is better for single nodes faults - Pessimistic log is better for concurrent faults  First comparison of coordinated checkpointing and message logging realized last year (Cluster 2003)  high fault recovery overhead of coordinated checkpoint  high overhead of message logging on fault free performance fault frequency implies tradeoff Several protocols to perform automatic fault tolerance in MPI applications - Coordinated checkpoint - Causal message log - Pessimistic message log All of them have been studied theoretically but not compared

Outline  Protocols and Related works  Comparison framework  Performance  Conclusion and future works

– A new device: ‘ch_v’ device – All ch_v device functions are blocking communication functions built over TCP layer Generic device: based on MPICH MPID_SendControl MPID_SendChannel _bsend _brecv _probe _from _Init _Finalize - get the src of the last message - check for any message avail. - blocking send - blocking receive - initialize the client - finalize the client MPI_Send _bsend Binding

Architectures MPICH-Vcl Chandy&Lamport algorithm Coordinated checkpoint MPICH-V for message logging protocols We designed MPICH-V to perform a fair comparison of coordinated checkpoint and pessimistic message log

Checkpointing method User-level Checkpoint : Condor Stand Alone Checkpointing Clone checkpointing + non blocking checkpoint code CSAC libmpichv Ckpt order (1) fork fork (2) Terminate ongoing coms (3) close sockets (4) call ckpt_and_exit() Checkpoint image is sent to reliable CS on the fly Resume execution using CSAC just after (4), reopen sockets and return

Pessimistic message logging example

Main limitations of the two implementations Global checkpointing Chekpoint server overloaded when restarting, even from a single fault Pessimistic message logging High latency during fault free execution due to the wait for acknowledge

Improvement of the checkpointing Add Local checkpointing on disk  Checkpoint server is less stressed

Causal message logging

Outline  Protocols and Related works  Comparison framework  Performance  Conclusion and future works

Experimental conditions Ethernet experiments: ● AthlonXP CPU, 1 GB DDR SDRAM, 70GB ATA100 IDE Disc 100Mbits/s Ethernet card connected by a single Fast Ethernet Switch Myrinet experiments: ● AthlonXP-MP CPU, 1 GB DDR SDRAM, 70GB ATA100 IDE Disc Myrinet2000 connected by a single 8-port myrinet switch SCI experiments ● AthlonXP CPU, 1 GB DDR SDRAM, 70GB ATA100 IDE Disc 2D-torus topology SCI Linux , GCC 2.96 (-O3), PGI Fortran <5 (-O3, -tp=athlonxp) node Network node node A single reliable node Checkpoint Server +Event Logger (message logging only) +Checkpoint Scheduler +Dispatcher

Bandwidth and latency (Ethernet) Latency for a 1 byte MPI message : TCP ( 75us) MPICH-P4 (100us) MPICH-V (135us) MPICH-Vcl (138us) MPICH-Vcausal(157us) MPICH-V2 (291us) Latency is high in MPICH-Vcl due to more memory copies compared to P4 Latency is even higher in MPICH-V due to the event logging.  A receiving process can send a new message only when the reception event has been successfully logged (3 TCP messages for a communication)

Bandwidth and latency (high speed networks) Latency for a 1 byte MPI message : TCP ( 43us) MPICH-P4 ( 53us) MPICH-V ( 94us) MPICH-Vcl ( 99us) MPICH-Vcausal(112us) MPICH-V2 (183us) TCP ( 23us) MPICH-P4 ( 34us) MPICH-V ( 76us) MPICH-Vcl ( 81us) MPICH-Vcausal(116us) MPICH-V2 (355us)

Benchmark applications

High number of message impacts causal performance for CG and MG benchmarcks

Benchmark applications Causal protocol unable to successfully finish LU for 32 nodes due to sender based payload to store

 20% overhead for fault free execution of Vcausal. (40% for pessimistic implementation). Crosspoint between Vcl and Vcausal at faults per second (0.002 for the crosspoint between pessimistic and remote-checkpoint Vcl)  If we consider a 1GB memory occupation for every process, an extrapolation expects the crosspoint to appear around one fault every 9 hours.  Message logging implementation can tolerate a high fault rate. MPICH-Vcl cannot ensure termination of the execution for a high fault rate. BT B 25 Fault impact on performance

Outline  Protocols and Related works  Comparison framework  Performance  Conclusion and future works

Conclusion  MPICH-Vcl and MPICH-Vcausal are two comparable implementations of fault tolerant MPI from the MPICH-1.2.5, one using coordinated checkpoint, the other a causal message logging  We have compared the overhead of these two techniques according to fault frequency  The recovery overhead is the main factor differentiating performance  We have found a crosspoint from which message logging becomes better than coordinated checkpoint. On our test application this crosspoint appears near 1 per 3 minutes. The crosspoint for a 1GB dataset application should be around 9 hours. Considering MTBF of cluster greater than 9 hours, the coordinated checkpoint appear to be appropriate.

Perspectives  Larger scale experiments  Use more nodes and applications with realistic amount of memory  Ameliorate causal protocol with new piggyback techniques  Zero copy to address high performance networks