Global States and Checkpoints

Slides:

Advertisements

Similar presentations

Visual Formalisms Message Sequence Charts Book: Chapter 10.

Advertisements

Replicated Dictionary and Log

Determining Global States of Distributed Systems Presented by Sanjeev R. Kulkarni.

Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.

Distributed Snapshots: Determining Global States of Distributed Systems - K. Mani Chandy and Leslie Lamport.

Global States in a Distributed System By John Kor and Yvonne Cheng.

Scalable Algorithms for Global Snapshots in Distributed Systems

Lecture 8: Asynchronous Network Algorithms

Uncoordinated Checkpointing The Global State Recording Algorithm.

Uncoordinated Checkpointing The Global State Recording Algorithm Cristian Solano.

Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.

Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

CS 582 / CMPE 481 Distributed Systems

Causality & Global States. P1 P2 P Physical Time 4 6 Include(obj1 ) obj1.method() P2 has obj1 Causality violation occurs when order.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

Slides for Chapter 10: Time and Global State

Computer Science Lecture 11, page 1 CS677: Distributed OS Last Class: Clock Synchronization Logical clocks Vector clocks Global state.

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.

EEC-681/781 Distributed Computing Systems Lecture 11 Wenbing Zhao Cleveland State University.

Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

Chapter 9 Global Snapshot. Global state  A set of local states that are concurrent with each other Concurrent states: no two states have a happened before.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Distributed Snapshot. Think about these -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes?

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.

Distributed Snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not.

Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.

Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.

Global state and snapshot

Consistent cut A cut is a set of events.

Prepared by Ertuğrul Kuzan

Global state and snapshot

EEC 688/788 Secure and Dependable Computing

Lecture 3: State, Detection

Operating System Reliability

Distributed Snapshot.

EECS 498 Introduction to Distributed Systems Fall 2017

Operating System Reliability

Operating System Reliability

EEC 688/788 Secure and Dependable Computing

Distributed Snapshot.

EEC 688/788 Secure and Dependable Computing

Distributed Snapshot Distributed Systems.

EEC 688/788 Secure and Dependable Computing

Uncoordinated Checkpointing

Slides for Chapter 11: Time and Global State

EEC 688/788 Secure and Dependable Computing

ITEC452 Distributed Computing Lecture 8 Distributed Snapshot

EEC 688/788 Secure and Dependable Computing

Distributed Snapshot.

EEC 688/788 Secure and Dependable Computing

Jenhui Chen Office number:

Distributed algorithms

CIS825 Lecture 5 1.

Consistent cut If this is not true, then the cut is inconsistent

Slides for Chapter 14: Time and Global States

Last Class: Fault Tolerance

Operating System Reliability

Chandy-Lamport Example

Operating System Reliability

Distributed Snapshot.

Presentation transcript:

Global States and Checkpoints CS 271

Distributed Checkpoints and Rollback Recovery Fault tolerance is achieved by periodically using stable storage to save the processes’ states during the failure-free execution. Upon a failure, a failed process rolls back from one of its saved states, thereby reducing the amount of lost computation. Each of the saved states is called a checkpoint CS 271

Checkpoint based Recovery Uncoordinated checkpointing: Each process takes its checkpoints independently Coordinated checkpointing: Process coordinate their checkpoints in order to save a system-wide consistent state. Communication-induced checkpointing: It forces each process to take checkpoints based on information piggybacked on the application messages it receives from other processes. CS 271

Domino effect: example Recovery Line P0 m7 m2 m5 m0 m3 P1 m6 m4 m1 P2 Domino Effect: Cascaded rollback which causes the system to roll back to too far in the computation (even to the beginning), in spite of all the checkpoints CS 271

Global State Chandy and Lamport—TOCS 1985 Global state of a distributed system Local state of each process Messages sent but not received Many applications need the state of the system Failure recovery, distributed deadlock detection Detect stable properties. Problem: how can you figure out the state of a distributed system? Each process is independent Network does not have any processing power. Distributed snapshot: a consistent global state CS 271

Global State A consistent cut An inconsistent cut CS 271

Distributed Snapshot Algorithm Assume each process communicates with another process using unidirectional FIFO point-to-point channels (e.g, TCP connections) Any process can initiate the algorithm Checkpoint local state Send MARKER on every outgoing channel On receiving a first marker on a channel: Process checkpoints local state and Send markers on all outgoing channels, and save messages on all other channels. On receiving subsequent marker on a channel: stop saving messages for that channel Saved messages are the state of the channel CS 271

Distributed Snapshot A process finishes when It receives a marker on each incoming channel and processes them all State: local state plus state of all channels Send state to initiator Any process can initiate snapshot Multiple snapshots may be in progress Each is distinguished by tagging the marker with the initiator ID (and sequence number) CS 271

Snapshot Algorithm Example Organization of a process and channels for a distributed snapshot CS 271

Snapshot Algorithm Example Process Q receives a marker for first time and records local state Q records all incoming message Q receives a marker for its incoming channel and finishes recording the state of the incoming channel CS 271

Execution Example Sp0 Sp1 Sp2 Sp3 p m1 m2 m3 q Sq0 Sq1 Sq2 Sq3 CS 271

Execution Example q records state as Sq1 , sends marker to p Sp0 Sp1 CS 271

Execution Example p records state as Sp2, channel state as empty Sp0 q Sq0 Sq1 Sq2 Sq3 CS 271

Execution Example q records channel state as m3 Sp0 Sp1 Sp2 Sp3 p m1 Sq0 Sq1 Sq2 Sq3 CS 271

Execution Example Recorded Global State = ((Sp2, Sq1), (0,m3) ) Sp0 CS 271

Take home Message (Snapshot and global states) General solution for global state detection. Causality based detection of stable properties. Simple efficient protocol, uses Markers and FIFO properties. Don’t forget Channel States. Foundation for Distributed Checkpointing and Rollback Recovery CS 271