1 Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms Author: Ozalp Babaoglu and Keith Marzullo Distributed Systems: 526.

Slides:



Advertisements
Similar presentations
Global States.
Advertisements

Virtual Time “Virtual Time and Global States of Distributed Systems” Friedmann Mattern, 1989 The Model: An asynchronous distributed system = a set of processes.
SES Algorithm SES: Schiper-Eggli-Sandoz Algorithm. No need for broadcast messages. Each process maintains a vector V_P of size N - 1, N the number of processes.
Time and Global States Part 3 ECEN5053 Software Engineering of Distributed Systems University of Colorado, Boulder.
CS542 Topics in Distributed Systems Diganta Goswami.
Distributed Computing 5. Snapshot Shmuel Zaks ©
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Logical Clocks and Global State.
Distributed Systems Dinesh Bhat - Advanced Systems (Some slides from 2009 class) CS 6410 – Fall 2010 Time Clocks and Ordering of events Distributed Snapshots.
S NAPSHOT A LGORITHM. W HAT IS A S NAPSHOT - INTUITION Given a system of processors and communication channels between them, we want each processor to.
CS 582 / CMPE 481 Distributed Systems
Causality & Global States. P1 P2 P Physical Time 4 6 Include(obj1 ) obj1.method() P2 has obj1 Causality violation occurs when order.
Ordering and Consistent Cuts Presented By Biswanath Panda.
CMPT 431 Dr. Alexandra Fedorova Lecture VIII: Time And Global Clocks.
Distributed Systems Fall 2009 Logical time, global states, and debugging.
CPSC 668Set 12: Causality1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
Slides for Chapter 10: Time and Global State
Ordering and Consistent Cuts Presented by Chi H. Ho.
Cloud Computing Concepts
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Computer Science Lecture 10, page 1 CS677: Distributed OS Last Class: Clock Synchronization Physical clocks Clock synchronization algorithms –Cristian’s.
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms CS 249 Project Fall 2005 Wing Wong.
Global Predicate Detection and Event Ordering. Our Problem To compute predicates over the state of a distributed application.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Logical Clocks and Global State.
Logical Clocks (2). Topics r Logical clocks r Totally-Ordered Multicasting r Vector timestamps.
Chapter 5.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
1 Distributed Systems CS 425 / CSE 424 / ECE 428 Global Snapshots Reading: Sections 11.5 (4 th ed), 14.5 (5 th ed)  2010, I. Gupta, K. Nahrtstedt, S.
Distributed Computing 5. Snapshot Shmuel Zaks ©
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Logical Clocks n event ordering, happened-before relation (review) n logical clocks conditions n scalar clocks condition implementation limitation n vector.
DISTRIBUTED ALGORITHMS By Nancy.A.Lynch Chapter 18 LOGICAL TIME By Sudha Elavarti.
Lecture 6-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Indranil Gupta (Indy) September 12, 2013 Lecture 6 Global Snapshots Reading:
Chapter 10: Time and Global States
Logical Clocks. Topics Logical clocks Totally-Ordered Multicasting Vector timestamps.
“Virtual Time and Global States of Distributed Systems”
Communication & Synchronization Why do processes communicate in DS? –To exchange messages –To synchronize processes Why do processes synchronize in DS?
Distributed Systems Fall 2010 Logical time, global states, and debugging.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Massachusetts Computer Associates,Inc. Presented by Xiaofeng Xiao.
Logical Clocks. Topics r Logical clocks r Totally-Ordered Multicasting.
Event Ordering. CS 5204 – Operating Systems2 Time and Ordering The two critical differences between centralized and distributed systems are: absence of.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 4-1 Computer Science 425 Distributed Systems (Fall2009) Lecture 4 Chandy-Lamport Snapshot Algorithm and Multicast Communication Reading: Section.
1 Chapter 11 Global Properties (Distributed Termination)
Logical Clocks event ordering, happened-before relation (review) logical clocks conditions scalar clocks  condition  implementation  limitation vector.
Distributed Systems Lecture 6 Global states and snapshots 1.
Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Theoretical Foundations
Distributed Snapshot.
COT 5611 Operating Systems Design Principles Spring 2012
Slides for Chapter 14: Time and Global States
Outline Theoretical Foundations - continued Lab 1
Time And Global Clocks CMPT 431.
Chapter 5 (through section 5.4)
Slides for Chapter 11: Time and Global State
Slides for Chapter 14: Time and Global States
Lecture 8 Processes and events Local and global states Time
Outline Theoretical Foundations - continued
Distributed Snapshot.
Jenhui Chen Office number:
Distributed algorithms
CIS825 Lecture 5 1.
Slides for Chapter 14: Time and Global States
COT 5611 Operating Systems Design Principles Spring 2014
Distributed Snapshot.
Presentation transcript:

1 Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms Author: Ozalp Babaoglu and Keith Marzullo Distributed Systems: 526 U1580 Professor: Ching-Chi Hsu

2 Introduction Many problems in distributed computing can be cast as executing some notification or reaction when the state of the system satisfies a particular condition Global Predicate Evaluation (GPE): to establish the truth of a Boolean expression whose variables may refer to the global systems state A global state may not be consistent Asynchronous system:  no bounds on the relative speeds of processes and message delays  Impossible to maintain synchronized local clocks  Communication remains the only possible mechanism for synchronization  channels are reliable but may deliver messages out of order

3 Outline Two Class of solutions to the GPE problem: A reactive-architecture: each process, when executing an event, notify P0 by sending it a message describing the event A snapshot architecture: the monitor P0 sends each process a ‘state enquiry’ message.

4 Definitions (1) distributed systems: a collection of sequential processes p 1, p 2,..., p n networked by unidirectional communication channels events: the activity of each sequential process, which can be internal events or communications: send(m) or receive(m) with another process local history of process p i : h i = e i 1 e i 2... global history: H = h 1  h 2 ...  h n cause-effect relation '->':  If e i k, e i l  h i and k<l, then e i k  e i l  If e i = send(m) and e j = receive(m), then e i  e j  If e  e' and e'  e'', then e  e'' Concurrent e||e': neither e  e' nor e'  e

5 Definitions (2) distributed computation: a partially ordered set defined by the pair (H,  ) space-diagram: representation of a distributed computation p1p1 p2p2 p3p3 e11e11 e12e12 e13e13 e14e14 e15e15 e16e16 e21e21 e22e22 e23e23 e31e31 e32e32 e33e33 e34e34 e35e35 e36e36

6 Definitions (3) local state of p i immediately after executing event e i k is denoted by  i k  global state:  (  ,...,  n ) a cut C  (c 1,...,c n ) is a subset of global history H and contains an initial prefix of each of the local histories, i.e. C  h 1 c 1  h n c n a run R is a total ordering of all events in H and is consistent with each local history  Example: pp6 Note that a single distributed computation may have many runs

7 Example Insistent cut and phantom deadlock p1p1 p2p2 p3p3 e11e11 e12e12 e13e13 e14e14 e15e15 e16e16 e21e21 e22e22 e23e23 e31e31 e32e32 e33e33 e34e34 e35e35 e36e36 C C’C’ req resp req

8 Consistency A consistent cut C, is such that  e and e', (e  C)  (e'  e) => e'  C A consistent global state is one corresponding to a consistent cut Aconsistent run R, is such that  e and e', (e  e') => e appears before e' in R  Example: pp6 If the run is consistent then all the global states in the sequence will be consistent as well

9 Observing Distributed Computations A monitor p 0 will assume a passive role in that it will not send any messages of its own The application processes notify p 0 by sending it a message whenever they execute an event The monitor p 0 constructs an observation of the underlying distributed computation as the events arrived Due to the variability of message delays, an observation can correspond to a consistent run, an inconsistent run or no run at all  O 1 = e 2 1 e 1 1 e 3 1 e 3 2 e 3 4 e 1 2 e 2 2 e 3 3 e 1 3 e 1 4 e => not a run  O 2 = e 1 1 e 3 1 e 2 1 e 3 2 e 1 2 e 3 3 e 3 4 e 1 3 e 2 2 e 3 5 e => inconsistent run  O 3 = e 3 1 e 2 1 e 1 1 e 1 2 e 3 2 e 3 3 e 1 3 e 3 4 e 1 4 e 2 2 e => consistent run To restore order of messages by defining a delivery rule for deciding when received messages are to be presented to the application process

10 First-In-First-Out(FIFO) delivery  for all messages m and m' from p i to p j  if send i (m)  send i (m') => deliver j (m)  deliver j (m') FIFO can be implemented by adding sequence numbers to messages While FIFO delivery is sufficient to guarantee that observations correspond to runs, it is not sufficient to guarantee consistent observations FIFO delivery

11 Observing Distributed Computations with Real-Time Clocks Environment:  message delays are bounded by   channels are FIFO  existence of a global real-time clock  each message includes RC(e), the global real-time clock when event e occurs, as its timestamp DR1:  At time t, deliver all received messages with timestatmps up to t-  in increasing timestamp order Observation is consistent iff the following is satisfied  Clock condition: e  e' => RC(e) < RC(e')

12 Observing Distributed Computations with Logical Clocks Environment:  channels are FIFO  asynchronous communication  implementation of logical clocks  each message includes LC(e), the logical clock when event e occurs, as its timestamp DR2:  Deliver all messages that are stable at p 0 in increasing timestamp order Note: a message m is stable at p if no future messages with timestamp < TS(m)  Given FIFO channels, m is stable at p 0 when p 0 has received at least one message with timestamp>TS(m) from all other processes

13 Logical Clocks p1p1 p2p2 p3p Logical Clock  each process p i maintains a local variable LC i  when a new event e i occurs, pi modifies LC i to  LC i + 1if e i is an internal or send event  max{ LC i, TS(m)} + 1if e i = receive(m)

14 Observing Distributed Computations with Causal Delivery Causal Delivery (CD):  send i (m)  send j (m') => deliver k (m)  deliver k (m') If p 0 uses a delivery rule satisfying CD, then all of its observations will be consistent

15 Efficient Delivering For implementing causal delivery, what is really needed is an effective procedure for deciding:  given events e,e' that are causally related and their clock values, does there exists some other event e'' such that e  e''  e' Given RC(e) <RC(e') (or LC(e)<LC(e')), it may be that  e  e' or e|| e', i.e.  e'  e) The above observations suggest a timing mechanism TC whereby causal precedence relations between events can be deduced from their timstamps Stong Clock Condition:  e  e'  TC(e) < TC(e')

16 Causal History (1) p1p1 p2p2 p3p3 e21e21 e22e22 e23e23 e31e31 e32e32 e33e33 e34e34 e35e35 e36e36 Causal history of event e 1 4 e11e11 e12e12 e13e13 e14e14 e15e15 e16e16 Causal history of event e  (e) = { e'  H | e'  e}  {e}  That is,  (e) is the smallest consistent cut that includes e

17 Causal Histories (2) Maintaining Causal History  Each process p i initializes local variable  i  to be   Each message m contains a timestamp TS(m) which is the causal history of its send event  Scheme  If e i is internal or send event,  then  i ={e i }  the causal history of the previous local event  If e i is the receive of message m by process p i from p j  then  i ={e i }  the causal history of the previous local event of p i   the causal history of the corresponding send event at p j  The strong clock condition is satisfied if clock comparison is interpreted as set inclusion  e  e'  (e)   (e') or e  e'  e   (e') if e  e'  Problem: the causal histories will grow rapidly

18 Vector Clocks The causal history of an event can be represented as a fixed- dimensional vector VC(e)[1..n] rather than a set, where  VC(e)[i] = k, iff  i (e) = h i k for i = 1,2,...,n p2p2 (1,2,4) (4,3,4) p3p3 p1p1 (0,1,0) (0,0,1)(1,0,2)(1,0,3)(1,0,4)(1,0,5)(1,0,6) (1,0,0)(2,1,0)(3,1,3)(4,1,3)(5,1,3)(6,1,3)

19 Maintaining Vector Clocks Maintaining Vector clock  Each process p i maintains a local vector VC i [1..n]  Each message m contains a timestamp TS(m) which is the vector clock value VC(e)of its send event e  Scheme  if e i is an internal or send event  VC i [i]= VC i [i] + 1, and VC(e i )=VC i  if e i = receive(m)  VC i = max { VC i, TS(m) }  VC i [i] = VC i [i] + 1  VC(e i )[j]  number of events of p j that causally precede event e i of p i  V < V'  (V  V')  k: 1  k  n: V[k]  V'[k])

20 Properties of Vector Clocks  Strong Clock Condition  Simple Strong Clock Condition  e  e'  VC(e) < VC(e')  e i  e j  VC(e i )[i]  VC(e j )[i]  Concurrent  e i ||e j  VC(e i )[i]  VC(e j )[i])  (VC(e j )[j]  VC(e i )[j])  Pairwise Inconsistent  i  j,  VC(e i )[i]  VC(e j )[i])  (VC(e j )[j]  VC(e i )[j])  Consistent Cut (c 1,c 2,..., c n ) iff   i, j: 1  i,j  n, VC(e i ci )[i]  VC(e j cj )[i]  Counting: the number of events precedes e is givent by #(e)  #(e) =  n j=1 VC(e)[j] -1  Weak Gap-Detection: Given e i and e j  if VC(e i )[k] < VC(e j )[k] for some k  j,  then  e k such that  (e k  e i )  (e k  e j )

21 Implementing Causal Delibery with Vector Clocks Babaoglu & Marzullo  monitor p 0 maintains an array D[1..n] where D[i] contains TS(m i )[i] where m i is the last message delivered from process p i DR3:  Deliver message m from process p j when both of the following is satisfied  D[j] = TS(m)[j] -1=> guarantee FIFO  D[k]  TS(m)[k],  k  j=> guarantee Causal Relation DR4:  Monitor p 0 maintains an counter D  Deliver message m of event e i as soon as  D = #(e i ) - 1

22 Causal Delivery with vector Clock Examples p1p1 (2,2) (3,2) p2p2 p0p0 (1,0) (0,0)(1,1)(1,2) [0,0] (1,1)(2,2) (0,0) (1,0)(1,2)(3,2)

23 Distributed Snapshots In this strategy, p 0 will request the states of the other processes and then combined them into a global state Definition:  channel state: for each channel from p i to p j,  i,j = set difference between  i and  j  incoming channels of process p i :IN i  outgoing channels of process p i :OUT i Snapshot Protocols  Chandy and Lamport [1985]  Morgan[1985]

24 Snapshot Protocol 1 Assumption:  existence of a global real-time clock : RC  Each message is attached with timestamp  Message delays are bounded global clock algorithm  P 0 sends [take snapshot at t ss ] to all processes  When clock RC reads tss, each process p i do the following  records its local state  i,  sends an empty message over all its outgoing channels  and starts recording all message received over each incoming channels  For the time p i receives a message from p j with timestamp greater than or equal to t ss, p i stops recording messages for that channel

25 Snapshot Protocol 2 Assumption:  Bounded message delays  Channels are FIFO Chandy & Lamport  P 0 send [take snapshot] to itself  For each process pi receiving [take snapshot]  If it is the first time  records its local state  i  sends each out-going channels [take snapshot]  starts recording messages from other incoming channels  If it is not the first time  stops recording message from that incoming channel

26 Chandy & Lamport (1985) p1p1 p2p2 p0p0 e11e11 e12e12 e13e13 e14e14 e15e15 e16e16 e21e21 e22e22 e23e23 e24e24 e25e25  Real computation R= e 2 1 e 1 1 e 1 2 e 1 3 e 2 2 e 1 4 e 2 3 e 2 4 e 1 5 e 2 5 e 1 6  in terms of global state =  00  01  11  21  31  32  42  43  44  54  55  65  e1*e1* e2*e2*

27 Properties of Snapshots Definition  a : the global state in which the snapshot protocol is initiated,  f : the global state in which the protocol terminates and  S : the global state constructed  e i * denote the event when p i receives [take snapshot] for the first time, causing p i to start recording its state  let the time be t i when e i * occurs  e i is a prerecordering event if e i  e i *,  otherwise it is a post-recording event Properties  Then there exists a run R' such that  a   S   f  That is to say  S could have happened

28 Argumentation (1) Chandy & Lamport(1985)  consider any (post-recordering, prerecordering) pair (e, e')  then  e  e')  swapping all such events will result in another consistent run R'  swap ( e 1 3, e 2 2 ) r1= e 2 1 e 1 1 e 1 2 e 2 2 e 1 3 e 1 4 e 2 3 e 2 4 e 1 5 e 2 5 e 1 6  swap ( e 1 4, e 2 3 ) r2= e 2 1 e 1 1 e 1 2 e 2 2 e 1 3 e 2 3 e 1 4 e 2 4 e 1 5 e 2 5 e 1 6  swap ( e 1 3, e 2 3 ) R'= e 2 1 e 1 1 e 1 2 e 2 2 e 2 3 e 1 3 e 1 4 e 2 4 e 1 5 e 2 5 e 1 6  the global state after executing the last prerecording event (e 2 3 ) in R' is  S (=  23 ), the constructed global state  If the computation goes in this run,  S could have happen

29 Argumentation (2) Lai & Yang(1987)  Let GSN(t i :p i  P) be a snapshot taken between  1 and  2, during the computation R.  Let  =  2 -  1, construct R' as follows:  R' is the same as R except that every post-recording event in R is now postponed for d units of time, that is  R'(t) =  R(t) if R(t) is an event at p i and t  t i  R(t-  )if R(t-  ) is an event at p i and t-   t i   otherwise   Example

30 Properties of Global Predicates Stable Predicates  Many system properties one wishes to detect have the characteristic that once they become true, they remain true  If  is a stable predicate, since  a   S   f  (  is true in  s ) => (  is true in  f )  (  is false in  s ) =>(  is false in  a ) Nonstable Predicates  the condition encoded by the predicate may not persist long enough for it to be true when the predicate is evaluated  if a predicate  is found to be true by the monitor, we do not know whether  ever held during the actual run

31 Nonstable Predicates Two problems The condition encoded by the predicate may not persist long enough for it to be true when the predicate is evaluated If a predicate  is found to be true by the monitor, we do not know whether  ever held during the actual run The predicate may have held even if it is not detected, and even if it is detected it may have never held. Extended nonstable global predicate: apply to the entire distributed computation Possibly(  ) Definitely(  )

32 Detecting Possibly and Definitely   min (  i k ) : the global state with the smallest level in the lattice containing  i k  max(  i k ) : the global state with the largest level in the lattice containing  i k Examples:  min (  1 3 ) =  31,  max (  1 3 ) =  33  min(  i k ) = (  1 c1,  2 c2,…,  n cn ): j: VC(  j cj )[j]=VC(  i k )[j]  max(  i k ) = (  1 c1,  2 c2,…,  n cn ): j: VC(  j cj )[i] VC(  j k )[i])) The minimum level containing  j k is the sum of components of the vector timestamp VC(  j k ) An algorithm for detecting Definitely(  ): O(k n ): k is the maximum number of events a monitored process has executed

33 Example