Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Slides:



Advertisements
Similar presentations
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Advertisements

Teaser - Introduction to Distributed Computing
PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Parallel and Distributed Simulation Time Warp: Basic Algorithm.
Parallel and Distributed Simulation Time Warp: State Saving.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Advanced / Other Programming Models Sathish Vadhiyar.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Fault Tolerant Systems
Time Warp State Saving and Simultaneous Events. Outline State Saving Techniques –Copy State Saving –Infrequent State Saving –Incremental State Saving.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
CS5204 – Operating Systems 1 Checkpointing-Recovery.
CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
1 Process migration n why migrate processes n main concepts n PM design objectives n design issues n freezing and restarting a process n address space.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
5. The Transport Layer 5.1 Role of Transport Layer It bridge the gab between applications and the network layer. Provides reliable cost-effective data.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Barnes Hut – A Broad Review Abhinav S Bhatele The 27th day of April, 2006.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
1 Fault Tolerance and Recovery Mostly taken from
Snapshots, checkpoints, rollback, and restart
The consensus problem in distributed systems
8.6. Recovery By Hemanth Kumar Reddy.
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Operating System Reliability
CIS 321 Data Communications & Networking
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Outline Distributed Mutual Exclusion Distributed Deadlock Detection
Distributed Systems, Consensus and Replicated State Machines
Operating System Reliability
Operating System Reliability
EEC 688/788 Secure and Dependable Computing
Parallel and Distributed Simulation
EEC 688/788 Secure and Dependable Computing
Operating System Reliability
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Parallel Exact Stochastic Simulation in Biochemical Systems
Laxmikant (Sanjay) Kale Parallel Programming Laboratory
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty

Parallel Programming Laboratory2 Overview Motivation Research Goals Basic Scheme Problems Solutions Status

Parallel Programming Laboratory3 Motivation As machines grow in size  Lower MTTF Plausible figures Number of Nodes MTTF for a Node 4 years (99.997% Reliability) MTTF for the System20 minutes  Checkpointing time is higher  Restart time is higher  If MTBF < Restart time checkpoint/restart is impossible

Parallel Programming Laboratory4 Costly Checkpoint/Restart  Synchronous checkpoints are too costly  Asynchronous checkpoints might cause cascading rollbacks  All nodes have to be restarted after a crash Inefficient use of resources to restart 99,999 nodes just because 1 crashed With low MTBF, large amount of computation time wasted in rolling back and re-computing. Even nodes that are independent of the crashed node are restarted

Parallel Programming Laboratory5 Idea Execution Time Application Progress Crash T_c Checkpoint and restart Our scheme T_p T_p Restart Cost T_c Checkpoint Cost

Parallel Programming Laboratory6 Research Goals Asynchronous checkpoint  Each process takes a checkpoint independently  No overhead for synchronization  Processes need not stop while checkpointing  Prevent cascading rollbacks Restart crashed processor  Only the crashed processor should be rolled back to the previous checkpoint  Fast restart

Parallel Programming Laboratory7 Research Goals (contd) Low runtime cost  While the system is fault-free, cost of fault tolerance should be low Implemented in Charm++  Virtualization and message driven paradigm  Charm++ is latency tolerant  Migration of objects is available Extend to Adaptive MPI

Parallel Programming Laboratory8 Basic Scheme Each object takes its checkpoint asynchronously An object logs the message it sends to a different object When a process crashes, another is restarted.  Might be on a pool of extra processors or another process on the same processors.  Objects might be later migrated away from the extra process and the residual process can be cleaned up.

Parallel Programming Laboratory9 The rest of the scheme When an object is restarted it is restarted from its last checkpoint All objects that sent messages to a restarted object must resend all messages that it received since its last checkpoint. Duplicate messages generated by the reprocessing of the resent messages should be ignored  Sequence number based windowing can be used

Parallel Programming Laboratory10 Blue Pink31 33 Red Pink Blue Green , 14, 15 15, 16, 17 31, , 13 32, 34 PE 0 PE 1 PE 2 Checkpoint Storage

Parallel Programming Laboratory11 Correctness Result of the program should be unchanged by the crash State of the restarted Chare, after it has received the resent messages should be the same as before the crash

Parallel Programming Laboratory12 State of a Chare Modified only by messages  So resent messages have all the data to bring the chare up-to date Order of message processing  Same messages processed in different order might lead to a different chare state  Messages must be processed in the same order after the restart as they were originally  This order needn ’ t be a specific order known to the user but any order selected by the system, which it can repeat after a crash

Parallel Programming Laboratory13 Without ordering B A C D S 1 S 2 S 3

Parallel Programming Laboratory14 Solution of the Ordering Problem Who decides the order ?  Best place to do it is the chare that is going to receive the messages Get a ticket number from the receiver and label each message with it Process all messages in the increasing order of ticket numbers Store the copy of the message along with the ticket number on the sender side

Parallel Programming Laboratory15 B A C D T 1 S 3 S 2 T 2 With Ordering

Parallel Programming Laboratory16 Pros and Cons Advantages  Defines an order among messages on the receiving side, which can be repeated after the crash and restore Disadvantages  Increases latency of communication  Overhead of a message increases

Parallel Programming Laboratory17 Logging local messages When a processor crashes, both the receiving object and message log of a local message disappears Obvious solution  Get a ticket from the receiving object using a function call.  Send a copy of the message to a “ buddy ” processor and wait for the ack  Then deliver the message to the receiver

Parallel Programming Laboratory18 Implementation Issues Migration makes things much more complicated As first cut implementing it for groups  They don ’ t migrate  Much simpler than the arrays

Parallel Programming Laboratory19 Status Ongoing Project  Fault tolerant version of Charm++  Currently aimed at small clusters  Present implementation is limited to non- migrating objects  Testing on simple test cases like Jacobi

Parallel Programming Laboratory20 Future Work Immediate Aims  Extend implementation to cover migratable objects  Detection scheme suitable for BlueGene/L  Test it on the BlueGene simulator  Implement a Fault tolerant version of Adaptive MPI  Optimize Performance to reduce runtime overhead  Test a full scale application like NAMD on Fault tolerant Charm running on the BlueGene simulator