Distributed Snapshots: Determining Global States of Distributed Systems Joshua Eberhardt Research Paper: Kanianthra Mani Chandy and Leslie Lamport.

Slides:



Advertisements
Similar presentations
Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Advertisements

Distributed Snapshots: Determining Global States of Distributed Systems - K. Mani Chandy and Leslie Lamport.
Global States.
Impossibility of Distributed Consensus with One Faulty Process
Global States in a Distributed System By John Kor and Yvonne Cheng.
Distributed Computing 5. Snapshot Shmuel Zaks ©
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Reliable Multicast Steve Ko Computer Sciences and Engineering University at Buffalo.
SES Algorithm SES: Schiper-Eggli-Sandoz Algorithm. No need for broadcast messages. Each process maintains a vector V_P of size N - 1, N the number of processes.
PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,
Uncoordinated Checkpointing The Global State Recording Algorithm.
Uncoordinated Checkpointing The Global State Recording Algorithm Cristian Solano.
6.852: Distributed Algorithms Spring, 2008 Class 12.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
CS542 Topics in Distributed Systems Diganta Goswami.
Distributed Computing 5. Snapshot Shmuel Zaks ©
OSU CIS Lazy Snapshots Nigamanth Sridhar and Paul A.G. Sivilotti Computer and Information Science The Ohio State University
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Logical Clocks and Global State.
Termination Detection. Goal Study the development of a protocol for termination detection with the help of invariants.
Termination Detection Part 1. Goal Study the development of a protocol for termination detection with the help of invariants.
Distributed Systems Dinesh Bhat - Advanced Systems (Some slides from 2009 class) CS 6410 – Fall 2010 Time Clocks and Ordering of events Distributed Snapshots.
S NAPSHOT A LGORITHM. W HAT IS A S NAPSHOT - INTUITION Given a system of processors and communication channels between them, we want each processor to.
CS 582 / CMPE 481 Distributed Systems
Causality & Global States. P1 P2 P Physical Time 4 6 Include(obj1 ) obj1.method() P2 has obj1 Causality violation occurs when order.
Ordering and Consistent Cuts Presented By Biswanath Panda.
Distributed Systems Fall 2009 Logical time, global states, and debugging.
Slides for Chapter 10: Time and Global State
Ordering and Consistent Cuts
20101 Synchronization in distributed systems A collection of independent computers that appears to its users as a single coherent system.
Ordering and Consistent Cuts Presented by Chi H. Ho.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms CS 249 Project Fall 2005 Wing Wong.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Logical Clocks and Global State.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
1 Distributed Systems CS 425 / CSE 424 / ECE 428 Global Snapshots Reading: Sections 11.5 (4 th ed), 14.5 (5 th ed)  2010, I. Gupta, K. Nahrtstedt, S.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Chapter 9 Global Snapshot. Global state  A set of local states that are concurrent with each other Concurrent states: no two states have a happened before.
Lecture 6-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 12, 2013 Lecture 6 Global Snapshots Reading:
1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.
Distributed Snapshot. Think about these -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes?
Distributed Systems Fall 2010 Logical time, global states, and debugging.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not.
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Chapter 11 Global Properties (Distributed Termination)
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
Distributed Systems Lecture 6 Global states and snapshots 1.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
Consistent cut A cut is a set of events.
Lecture 3: State, Detection
Lecture 3: State, Detection
Theoretical Foundations
Distributed Snapshot.
Distributed Snapshot.
湖南大学-信息科学与工程学院-计算机与科学系
Distributed Snapshot Distributed Systems.
Uncoordinated Checkpointing
Slides for Chapter 11: Time and Global State
ITEC452 Distributed Computing Lecture 8 Distributed Snapshot
Distributed Snapshot.
Jenhui Chen Office number:
Distributed algorithms
CIS825 Lecture 5 1.
Consistent cut If this is not true, then the cut is inconsistent
Slides for Chapter 14: Time and Global States
Chandy-Lamport Example
Distributed Snapshot.
Presentation transcript:

Distributed Snapshots: Determining Global States of Distributed Systems Joshua Eberhardt Research Paper: Kanianthra Mani Chandy and Leslie Lamport

Background What is a distributed system? Set of autonomous computers Communication network Software that integrates it into a single entity

Figure 1

Overview Introduction Model of a Distributed System Global-state Detection Algorithm Motivation Termination Stability Detection

Overview Introduction Model of a Distributed System Global-state Detection Algorithm Motivation Termination Stability Detection

Processes in Distributed Systems Process is an instance of a computer program being executed. Processes in a distributed system communicate by sending and receiving messages. A process can record its own state and the message it sends and receives.

Global States and Processes To determine a global state, a process p must cooperate with other processes to record their own states and send them to p. Main problem is to devise an algorithm to record global states.

Global State Detection Problems Let y, be a predicate function defined over the global states of the a distributed system D. (In other words, y(S) is true or false for a global state S of D) The predicate y is a stable property of D if y(S) implies y(S’) for global states S’ of D reachable from S of D

Going Further Many distributed system problems can be formulated as the general problem of creating an algorithm by which a process in a distributed system can determine whether a stable property y holds. Examples Deadlock Detection Termination Detection

Structure of Distributed Algorithms Structured as sequence of phases. Transient Part Stable Part Stability needs to be detected so that one phase can be terminated and another initiated. Termination of a Computational Phase vs. Termination of a Computation

Termination Phase The overall problem can be partitioned into the problems of detecting the termination of one phase and initiating a new phase. Example of a stable property The kth computational phase has terminated where k = 1, 2, 3, … Thus we can determine the termination of the kth phase for any given k.

Overview Introduction Model of a Distributed System Global-state Detection Algorithm Motivation Termination Properties Stability Detection

Channels A distributed system consists of a finite set of processes and a finite set of channels. Properties of channels. Infinite buffers Error-free Deliver messages in order sent.

Linking the Terms State of a channel Sequence of messages sent along the channel. Process Defined by a set of states, including the initial state and a set of events. Event An atomic action that may change the state of a process and the state of at most one channel that is incident of the process.

Figure 2 Distributed system with processes p, q, r and channels C1, C2, C3, C4.

Events Can be defined by Process p in which the event occurs State s of p before the event State s’ of p after the event Channel c whose state is altered by the event Message M sent along channel c Based on these definitions we can define event e into a 5-tuple.

Expanding to Global States Global state of a distributed system is a set of component process and channel states. Initially, all of the states are at their initial state, and as a consequence all of the channels would be the empty sequence. Occurrences of events may change the global state.

Events and Global States Remember e = We can say e can occur in a global state S: The state of p in S is s If c is directed towards p, then the state of c in S is a sequence of messages with M at the head. If c is directed away from p, then the state of c in S is a sequence of messages with M at the tail.

Going Further If c is directed towards p, then the state of c in S is a sequence of messages with M at the head. Define a function next where next(S, e) is the global state immediately after the occurrence of event e in global state S. The value of next(S, e) is defined only if event e can occur in global state S.

Computational Model Let seq = (e i : 0 < i < n) be a sequence of events in component processes of a distributed system. S i+1 = next(S i, e i ) for (0 < i < n) where S 0 is the initial global state. We can say seq is a computation of the system iff e i can occur in S i

Example: Single Token Conversation (Deterministic) Simple distributed system State Transition Diagram of a Process

Example: Single Token Conversation (Deterministic)

Example: Message Passing (Nondeterministic) New State Transition Diagrams

Example: Message Passing (Nondeterministic) More then one way to change the initial global states, all subsequent states would then be different.

Overview Introduction Model of a Distributed System Global-state Detection Algorithm Motivation Termination Properties Stability Detection

Motivation How it works: Each process records its own state and the 2 processes that a channel is incident on cooperate in recording the channel state. Algorithm is to be superimposed on the underlying computation. Next example will show how we can record the state of a channel instantaneously. Let c be a channel from p to q.

Single Token Example Assume the state of process p is recorded as “in p”. Now assume that the global state transitions to “in c”. Suppose the states of c, c’, and q were also recorded in the global state “in c”. This global state shows that there are two tokens! This shows inconsistency because the state of p was recorded before p sent the message along c and the state of c is recorded after p sent the message.

Notation Let n be the number of messages sent along c before p’s state is recorded. Let n’ be the number of messages sent along c before c’s state is recorded. In our example, this inconsistency shows that n < n’ or (0 < 1)

Another scenario Suppose the state of c is recorded in global state “in p”. The system then transitions to the global state “in c” and the states of c’, p and q are recorded in the global state “in c”. The recorded state shows no tokens in the system! This shows inconsistency when the state of c is recorded before p sends a message along c and the state of p is recorded after p sends a message along c. Other words n > n’ (1 > 0) To maintain consistency, n = n’

In Relation to Messages Received Let m be the number of messages received along c before q’s state is recorded. Let m’ be the number of messages received along c before c’s state is recorded. To show consistency, m = m’ So for every state the number of messages received along a channel can’t exceed the number of messages sent along that channel. In other words n > m and n’ > m’.

Bank Example

Important Details to Note The state of channel c that is recorded must be the sequence of messages sent along the channel before the sender’s state is recorded. If n’ = m’, the recorded state of c must be the empty sequence. If n’ > m’, the recorded state of c must be the (m’ + 1)st…… nth messages sent by p along c.

Markers From these conditions we can devise an algorithm by which q can record the state of the channel c. Process p sends a marker after the nth message it sends along c and before sending any messages further along c. The state of c is the sequence of messages received by q after q records its own state and before q sends the marker along c. To ensure n > m, q must record its state after receiving a marker along c and before q receives further messages along c.

Algorithm Outline Marker Sending Rule for a Process p For each channel c, incident on and directed away from p: p sends a marker along c after p records its state and before p sends further messages along c.

Algorithm Outline Marker Receiving Rule for a Process q On receiving a marker along a channel c: if (q hasn’t recorded its state) record q q records c as the empty sequence else q records the state of c as the sequence of messages received along c after q’s state was recorded and before q received the marker along c

Overview Introduction Model of a Distributed System Global-state Detection Algorithm Motivation Termination Properties Stability Detection

Termination of the Algorithm The marker receiving and sending rules guarantee that if a marker is received along every channel, then each process will record its state and the states of all incoming channels.

Finite Time To ensure that the global state recording algorithm terminates in finite time, each process ensures No marker remains forever in an incident input channel. It records its state within finite time of initiation of the algorithm.

Finite Time If process records its state and there is a channel from p to q, then q will record its state in finite time. Termination in finite time is ensured if for every process q, q records its state or there is a path from p which records its state to q.

Overview Introduction Model of a Distributed System Global-state Detection Algorithm Motivation Termination Stability Detection

Motivation It is a paradigm for many practical problems, such as distributed deadlock detection. Can be defined as follows Input: A stable property of y Output: Boolean value definite with the property (y(S  )  definite) or (definite  y(S  )) where S  represents the global state when initiated and S  represents the global state when it is terminated.

What this means Input of the algorithm is based on the function of y. During execution of the algorithm the value y(S) for any global state S may be determined by a process in the system. With the output of the algorithm stored in the boolean value definite, we mean that Process p enters and thereafter remains in some special state to signal that definite = true or false.

Definite value Definite = true Implies the stable property holds when the algorithm terminates. Definite = false Implies the stable property doesn’t hold when the algorithm is initiated.

Solution begin record a global state S*; definite := y(S*); end. Correctness of the stability detection algorithm S* is reachable from S  S  is reachable from S* (Theorem) y(S)  y(S’) for all S’ reachable from S (definition of stable property)

Conclusion Distributed systems are applied to many applications used today, especially in database applications. Its important to know how each of the processes interact with each other and to know the global state of the system to ensure it is consistent.

References Chandy, K. M. and Lamport L. Distributed Snapshots: Determining Global States of Distributed Systems 0-Spring2011/Literature/ChandyAndLamport.pdf Llewellyn M. Intro to OS: (Distributed Process Management) distributed%20process%20management%20- %20part%202%20(12).pdf