Global States in a Distributed System By John Kor and Yvonne Cheng.

Slides:



Advertisements
Similar presentations
Determining Global States of Distributed Systems Presented by Sanjeev R. Kulkarni.
Advertisements

Distributed Snapshots: Non-blocking checkpoint coordination protocol Next: Uncoordinated Chkpnt.
Distributed Snapshots: Determining Global States of Distributed Systems - K. Mani Chandy and Leslie Lamport.
Global States.
Distributed Snapshots: Determining Global States of Distributed Systems Joshua Eberhardt Research Paper: Kanianthra Mani Chandy and Leslie Lamport.
Global States and Checkpoints
Distributed Computing 5. Snapshot Shmuel Zaks ©
Uncoordinated Checkpointing The Global State Recording Algorithm.
Uncoordinated Checkpointing The Global State Recording Algorithm Cristian Solano.
CS542 Topics in Distributed Systems Diganta Goswami.
Distributed Computing 5. Snapshot Shmuel Zaks ©
OSU CIS Lazy Snapshots Nigamanth Sridhar and Paul A.G. Sivilotti Computer and Information Science The Ohio State University
Distributed Systems Dinesh Bhat - Advanced Systems (Some slides from 2009 class) CS 6410 – Fall 2010 Time Clocks and Ordering of events Distributed Snapshots.
S NAPSHOT A LGORITHM. W HAT IS A S NAPSHOT - INTUITION Given a system of processors and communication channels between them, we want each processor to.
CS 582 / CMPE 481 Distributed Systems
Causality & Global States. P1 P2 P Physical Time 4 6 Include(obj1 ) obj1.method() P2 has obj1 Causality violation occurs when order.
Ordering and Consistent Cuts Presented By Biswanath Panda.
CMPT 431 Dr. Alexandra Fedorova Lecture VIII: Time And Global Clocks.
Distributed Systems Fall 2009 Logical time, global states, and debugging.
Slides for Chapter 10: Time and Global State
Ordering and Consistent Cuts
20101 Synchronization in distributed systems A collection of independent computers that appears to its users as a single coherent system.
Ordering and Consistent Cuts Presented by Chi H. Ho.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
1 Distributed Systems CS 425 / CSE 424 / ECE 428 Global Snapshots Reading: Sections 11.5 (4 th ed), 14.5 (5 th ed)  2010, I. Gupta, K. Nahrtstedt, S.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Chapter 9 Global Snapshot. Global state  A set of local states that are concurrent with each other Concurrent states: no two states have a happened before.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Lecture 6-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 12, 2013 Lecture 6 Global Snapshots Reading:
1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.
Distributed Snapshot. Think about these -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes?
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Distributed Systems Fall 2010 Logical time, global states, and debugging.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not.
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Chapter 11 Global Properties (Distributed Termination)
Hwajung Lee. -- How many messages are in transit on the internet? --What is the global state of a distributed system of N processes? How do we compute.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Distributed Systems Lecture 6 Global states and snapshots 1.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
Global state and snapshot
Consistent cut A cut is a set of events.
Global State Recording
Global state and snapshot
Lecture 3: State, Detection
CSE 486/586 Distributed Systems Global States
Theoretical Foundations
Distributed Snapshot.
Global State Recording
EECS 498 Introduction to Distributed Systems Fall 2017
Distributed Snapshot.
湖南大学-信息科学与工程学院-计算机与科学系
Time And Global Clocks CMPT 431.
Distributed Snapshot Distributed Systems.
Uncoordinated Checkpointing
Slides for Chapter 11: Time and Global State
ITEC452 Distributed Computing Lecture 8 Distributed Snapshot
Distributed Snapshot.
CSE 486/586 Distributed Systems Global States
Jenhui Chen Office number:
Distributed algorithms
CIS825 Lecture 5 1.
Consistent cut If this is not true, then the cut is inconsistent
Slides for Chapter 14: Time and Global States
Distributed Snapshot.
Presentation transcript:

Global States in a Distributed System By John Kor and Yvonne Cheng

Initial Problem Example Garbage Collector Free’s up memory which is no longer in use Check’s if a reference to memory still exists What about in a distributed system

Initial Problem Example (cont’d) A distributed system consists of multiple processes Each process is located on a different computer No sharing of processor or memory

Initial Problem Example (cont’d) Each process can only determine its own “state” Problem: How do we determine when to garbage collect in a distributed system? How do we check whether a reference to memory still exists?

System Model A distributed system consists of multiple processes Each process is located on a different computer Each process consists of “events” An event is either sending a message, receiving a message, or changing the value of some variable Each process has a communication channel in and out

Our Garbage Collection Problem In order to test whether a certain property of our system is true, we cannot just look at each process individually A “snapshot” of the entire system must be taken to test whether a certain property of the system is true This “snapshot” is called a Global State

Definition The global state of a distributed system is the set of local states of each individual processes involved in the system plus the state of the communication channels.

Determinism Deterministic Computation At any point in computation there is at most one event that can happen next. Non-Deterministic Computation At any point in computation there can be more than one event that can happen next.

Deterministic Computation

Non-Deterministic Computation

Determinism Deterministic computation A local event would reveal everything about the global state! The process will know other process’ state Non-Deterministic computation Because of branching, a local event cannot reveal what the next step will be

Simple Algorithm Create a new process that collects the states of every other process Every process will save their state at an arbitrary time and send it to this new process

Advantages Very simple Easy to implement

Problems? Based on the assumption that all processes work on a synchronized global clock Wrong assumption!

Problems (cont’d) State recorded by p m pq

Problems (cont’d) pq m

State recorded by q pq m

Problems (cont’d) Global state recorded m pq m

Another view p q m

Process p has no record of sending m Process q HAS record of receiving m Problem? Global state does not show p sending m, therefore there is confusion as to where m came from Breaks the Consistency concept

Consistency A global state is consistent if it could have been observed by an external observer If e  e`, then both e and e` must reside within the same state For a successful Global State, all states must be consistent

Solution Need to develop an asynchronous algorithm Cannot depend on a clock Must ensure consistency in all global states

Assumptions Distributed system: Finite set of processes and channels; described by graph Processes Set of states, initial state, set of events Channels FIFO, error-free, infinite buffers, arbitrary but finite delay

PART 2 Presented By: Yvonne

Idea of a global state recording algorithm -each process records its own state -the two processes incident by one channel cooperate in recording the channel state

Challenge -No global clock -Need a meaningful result -Superimposed on underlying computation

Meaningful: The notion of Consistency -it could have been observed by an external observer -All feasible states are consistent

An Example p q p q Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 m1m1 m2m2 m3m3

A Consistent State? p q Sp1Sp1 Sq1Sq1 p q Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 m1m1 m2m2 m3m3

Yes p q p q Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 m1m1 m2m2 m3m3 Sp1Sp1 Sq1Sq1

A Consistent State? p q Sp2Sp2 Sq3Sq3 m3m3 p q Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 m1m1 m2m2 m3m3

Yes p q p q Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 m1m1 m2m2 m3m3 Sp2Sp2 Sq3Sq3 m3m3

An inconsistent State p q p q Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 m1m1 m2m2 m3m3 Sp1Sp1 Sq3Sq3

Conducting algorithm: Using An Example -Processes: p and q -Channels: c and c’ -Token: t pq c c’

An Example -p records its state t pq c c’

An Example -q, c, and c’ record their states t pq c c’

An Example -The composite global state! t pq c c’ t

An Example -n: number of messages sent along c before p’s state is recorded -n’: number of message sent along c before c’s state is recorded pq c c’

An Example - Reason of inconsistency: n<n’ t pq c c’ t pq c n = 0 n’ = 1

Similar scenario c is recorded when the token is at process p. p sends the token through channel c, and the states of c’, p, and q are recorded. The recorded global state : no tokens in the system. The reason of inconsistency : n>n’

Conclusion from the example A consistent global state requires n = n’

Similar Conclusion m : number of messages received along c before q’s state is recorded m’ : number of messages received along c before c’s state is recorded To be consistency: m=m’

Some other equations m’ : number of messages received along c before c’s state is recorded n’ : number of messages sent along c before c’s state is recorded m : number of messages received along c before p’s state is recorded n : number of messages sent along c before p’s state is recorded n = n’ m = m’ n’ >= m’ n >= m

Other Fact The state of channel c that is recorded must be the sequence of messages sent along the channel before the sender’s state is recorded, excluding the sequence of messages received along the channel before the receiver’s state is recorded. Two cases: n’=m’ : c is empty n’>m’: c must be the (m’+1)st…n’th messages sent by p along c

Put All Together: A brief sketch of the algorithm p sends a marker message along all its outgoing channels after it records its state and before it sends any other messages. On receipt of a marker message from channel c else state ( c ) = messages received on c since it had recorded its state excluding the marker. if p has not recorded its state record the state state ( c ) = EMPTY

Chandy and Lamport Algorithm Features: Does not promise us to give us exactly what is there But gives us consistent state!!

Algorithm in Action p q Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 m1m1 m2m2 m3m3

p q Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 m1m1 m2m2 m3m3 q records state as S q 1, sends marker to p

Algorithm in Action p q Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 m1m1 m2m2 m3m3 p records state as S p 2, channel state as empty

Algorithm in Action p q Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 m1m1 m2m2 m3m3 q records channel state as m 3

Algorithm in Action p q Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 m1m1 m2m2 m3m3 Recorded Global State = ((S p 2, S q 1 ), (0,m 3 ) )

Algorithm in Action p q Sq0Sq0 Sq1Sq1 Sq2Sq2 Sq3Sq3 Sp0Sp0 Sp1Sp1 Sp2Sp2 Sp3Sp3 m1m1 m2m2 m3m3 Recorded Global State = ((S p 2, S q 1 ), (0,m 3 ) ) Computation may not even have passed through the state recorded!

What have we recorded The recorded consistent state can be anything!

Properties of the recorded global state S i : global state when the algorithm starts S j : global state when the algorithm finishs S * : state recorded by the algorithm Then S * is reachable from S i S j is reachable from S *

S * Is reachable from S i SiSi SjSj

S j Is reachable from S * SiSi SjSj

Still what good is it? Stable Properties A property Y is called a stable property iff for all states S` reachable from S Y(S) -> Y(S’)

Detection of Stable Properties Outcome = false; while ( outcome == false ) { determine Global State S; outcome = Y (S); }

Checkpoint S* serves as a checkpoint On a failure, restart the computation from S* Problem! Not able to restore to Sj SiSi SjSj S*S*

Solution: Publishing A Broadcast medium A central recorder process records all the messages received by each process Processes record their states at their own time and send it to the recorder

Determining Global State Recorder can construct global state from Checkpointed States of all processes Plus Messages recorded since last checkpoint

Problems Publishing keeps track of all messages received by each process Expensive! Solution recorder takes checkpoint of process p at time t deletes all messages recd by p before t.

Comparison

Conclusion Global State detection difficult in Distributed Systems Snapshot algorithm may not give an actual state but is very helpful in detecting Stable Properties Publishing gives an asynchronous way of determining global states but is unscalable