Distributed Algorithms Luc J. B. Onana Alima Seif Haridi.

Distributed Algorithms Luc J. B. Onana Alima Seif Haridi

2 Introduction What is a distributed system ? set of autonomous processors interconnected in some way What is a distributed algorithm (protocol) ? Concurrently executing components each on a separate processor Distributed algorithms can be extremely complex: many components run concurrently; locality; failure; non-determinism; independent inputs; no global clock, uncertain message delivery; uncertain messages ordering,. Can we understand everything about their executions?

3 Ch9: Models of Distributed Computation Preliminaries Notations Assumptions Causality Lamport Timestamps Vector Timestamps Causal Communication Distributed Snapshots Modeling a Distributed Computation Execution DAG Predicates Failures in Distributed System

4 Ch9 Models: Preliminaries Assumptions A1: No shared variables among processors A2: On each processor there are a number of executing threads A3: Communication by sending and receiving messages send(dest,action,param) is non-blocking; A4: Event driven algorithms: reaction upon receipt of a declared event; Events: sending or receiving a message; etc. An event is buffered until it is handeled; Dedicated thread to handle some events at any time;

5 Ch9 Models: Preliminaries Notations Waiting for events wait for A1,A2,…,An on Ai (source;param) do code to handle Ai, 1<= i <=n end Waiting for an event from p up to T seconds wait until p sends (event;param), timeout=T on timeout do timeout action end on event(param) from p do Successful response actions

6 Ch9 Models: Preliminaries Notations Waiting for events wait for A1,A2,…,An on Ai (source;param) do code to handle Ai, 1<= i <=n end Waiting for an event from p up to T seconds wait for p on timeout do time-out action end on Ai(param) from p do action end end

7 Ch9 Models: Preliminaries Notations Waiting for responses from a set of processors up to T seconds wait up to T seconds for (event;param) messages Event: To be considered if necessary.

8 Ch9 Models: Preliminaries Concurrency control within an instance of a protocol Definition : Let P be a protocol. If instance of P at processor q consists of threads T 1, T 2, T 3, …, T n, we say that T 1, T 2, …, T n are in the same family. they access the same set of variables; need for concurrency control; Assumption used: A5: Once a thread gains control of the processor, it does not release control to a thread of the same family until it is blocked.

9 Ch9 Models: Causality There is no global time in a distributed system  processors cannot do simultaneous observations of global states Causality serves as a supporting property Provided traveling backward in time is excluded, distributed systems are causal The cause precedes the effect. The sending of a message precedes the receipt of that message

10 Ch9 Models: Causality System composition we assume a distributed system composed of the set processors P = {p 1, …, p M }. Each processor reacts upon receipt of an event Two classes of events: External/Communication events: sending a message; receiving a message Internal events: local input/output; raising of a signal; decision on a commit point (database); etc.

11 Ch9 Models: Causality Notations: E : the set of all possible events in our system E p : the set of all events in E that occur at processor p We are interested in defining orders between events Why? In many cases, orders are necessary for coordinating distributed activities (e.g. many concurrency control algorithms use ordering of events we’ll see this later)

12 Ch9 Models: Causality Orders between events 1) on the same processor p Order: < p, e < p e’  ``e occurs before e’ in p´´. If e and e’ occur on the same processor p then either e < p e’ or e’ < p e i.e. in the same processor events are totally ordered Time p e e’ e < p e’

13 Ch9 Models: Causality Orders between events 2) of sending message m and receiving message m Order: < m If e is the sending of message m, and e’ the receipt of message m then e < m e’

14 Ch9 Models: Causality Orders between events 3) in general (i.e. all events in E are considered) Order: < H ``happens-before´´ or ``can causally affect´´ Definition < H is the union of < p and < m (for all p,m), and transitive (i.e. if e 1 < H e 2 and e 2 < H e 3 then e 1 < H e 3 ) Definition: we define a causal path from e to e’ as a sequence of events e 1,e 2,…,e n such that 1) e=e 1 ; e’=e n 2) for each i in {1,..,n}, e i < H e i+1 Thus, e < H e’ if only if there is a causal path from e to e’

15 Ch9 Models: Causality Happens-before is a partial order It is possible to have two events e and e’ (e  e’) such that neither e < H e’ nor e’ < H e If two events e and e’ are such that neither e < H e’ nor e’ < H e, then e and e’ are concurrent and we write e || e’ The possibility of concurrent events implies that the happens-before (< H ) relation is a partial order

16 Ch9 Models: Causality Space-Time diagram:Happens-before DAG p1p1 p2p2 p3p3 e7e7 e4e4 e3e3 e2e2 e5e5 e8e8 e6e6 e1e1 Time No causal path neither from e 1 to e 2 nor from e 2 to e 1 e 1 and e 2 are concurrent No causal path neither from e 1 to e 6 nor from e 6 to e 1 e 1 and e 6 are concurrent No causal path neither from e 2 to e 6 nor from e 6 to e 2 e 2 and e 6 are concurrent Dependencies must point forward in time

17 Ch9 Models: Causality Space-Time diagram:Happens-before DAG p1p1 p2p2 p3p3 e7e7 e4e4 e3e3 e2e2 e5e5 e8e8 e6e6 e1e1 Time Compare: e 1 and e 7 ; e 1 and e 8; e 5 and e 2 ; e 4 and e 6

18 Ch9 Models: Causality Global Logical Clock (Time stamps) Although there is no global time in a distributed system, a Global Logical Clock (GLC) that assigns total order to the events in a distributed system is very useful Such a global logical clock can be used to arbitrate requests for resources in a fair manner, breaks deadlock, etc. A GLC should assign a time stamp t(e) to each event e such that t(e) < t(e’) or t(e’) < t(e) for e  e’, furthermore the order imposed by the GLC should be consistent with < H., that is if e < H e’ then t(e) < t(e’)

19 Ch9 Models: Causality Lamport’s Algorithm Gives a Global Logical Clock consistent with < H Each event e receives an integer e.TS such that e < H e’  e.TS < e’.TS Concurrent events (unrelated by < H ) are ordered according to the processor address (assume these are integers) Timestamps t(e) = (e.TS,p) when e occurs at processor p Ordering of timestamps: (e.TS,p) < (e’.TS,q) iff e.TS < e’.TS or e.TS = e’.TS and p < q

20 Ch9 Models: Causality Lamport’s Algorithm (cont.) Each processor maintains p a local timestamp my_TS Each processor attaches its timestamp to all messages that it sends

21 Ch9 Models: Causality Lamport’s Timestamp algorithm Initially, my_TS = 0 wait for any event e on e do if e is the receipt of message m then my_TS := max(m.TS,my_TS)+1; e.TS := my_TS elseif e is an internal event then my_TS := my_TS+1 ; e.TS := my_TS elseif e is the sending of message m then my_TS := my_TS+1 ; e.TS := my_TS; m.TS = my_TS end

22 Ch9 Models: Causality Lamport’s Algorithm (cont.) Lamport’s algorithm ensures that e < H e’  e.TS < e’.TS Reason: if e 1 < p e 2 or e 1 < m e 2 then e 2 is assigned a higher timestamp than e 1 Note: It is easy to see that the algorithm presented does not assign total order to the events in the system.  Processor address to break the ties

23 Ch9 Models: Causality Lamport’s timestamps illustrated p1p1 p2p2 p3p3 e7e7 e4e4 e3e3 e2e2 e5e5 e8e8 e6e6 e1e1 Why e 7 is labeled (3,1)? e 8 is labeled (4,3)? (1,1) (2,1) (3,1) (1,2) (2,2) (3,2) (1,3) (4,3) Time

24 Ch9 Models: Causality Lamport’s timestamps algorithm Has the following properties: Completely distributed Simple Fault tolerant Minimal overhead Many applications

25 Ch9 Models: Causality Vector Timestamps Lamport Timestamps guarantee that if e < H e’ then e.TS < e’.TS but there is no guarantee that if e.TS < e’.TS then e < H e’ Problem: given two arbitrary events e and e’ in E, we want to determine if they are causally related Why this problem is interesting?

26 Ch9 Models: Causality Knowing when two events are causally related is useful To see this, consider the following H-DAG in which O is a mobile object p1p1 p2p2 p3p3 Where is O ? Migrate O on p 2 On p 2 Where is O ? I don’t know Error ! m3m3 m2m2 m1m1 Time When you debug the system after the red line, you will find that the object is at p 2. So, why p 2 don’t know where the object is ?

27 Ch9 Models: Causality Causally precedes relation < c between messages Let s(m) be the event of sending message m r(m) the event of receiving message m Definition: m 1 < c m 2 if s(m 1 ) < H s(m 2 ) A Causality violation occurs when there are messages m 1 and m 2, a processor p such that s(m 1 ) < H s(m 2 ) and r(m 2 ) < p r(m 1 ) p1p1 p2p2 Time s(m 1 ) s(m 2 ) r(m 2 ) r(m 1 ) The simplest form of causality violation: the sending events are on the same processor p 1 the receiving events are on the same processor p 2

28 Ch9 Models: Causality Causality violation (ex: distributed object system) When p 3 receives “I don’t know” message from p 2, p 3 has inconsistent information : From p 1, p 3 knows O is on p 2 but from p 2, p 3 knows O is not on p 2 ! The source of the problem is: m 1 < c m 3 but r(m 3 ) < p2 r(m 1 ) i.e. there is a causality violation. Thus for two events e and e’, if we know exactly whether e < H e’ then we can detect causality violation Vector timestamps gives this.

29 Ch9 Models: Causality Vector Timestamps Idea: each event e indicates for each processor p, all events at p that are causally before e

30 Ch9 Models: Causality The idea illustrated p1p1 p4p4 p3p3 p2p2 1 2 3 1 2 4 3 5 1 6 4 5 2 3 2 1 3 4 e

31 Ch9 Models: Causality Vector Timestamps Idea: each event e indicates which events in each processor p causally precede e Each event e has a vector timestamp e.VT such that e.VT < V e’.VT  e < H e’ e.VT is an array with an entry for any processor p; For any processor p e.VT[p] is an integer and e.VT[p]=k means e causally follows the first k events that occur at p (one assumes that each event follows itself)

32 Ch9 Models: Causality The meaning of e.VT[p] illustrated p1p1 p4p4 p3p3 p2p2 1 2 3 1 2 4 3 5 1 6 4 5 2 3 2 1 3 4 e e.VT[p 1 ]=3 e.VT[p 2 ]=6 e.VT[p 3 ]=4 e.VT[p 4 ]=2

33 Ch9 Models: Causality Vector Timestamps Ordering < V on vector timestamps is defined as e.VT < V e’.VT iff a) e.VT[i]  e’.VT[i] for all i in {1,..,M} and b) there is j in {1,..,M} such that e.VT[j]  e’.VT[j] Example: (1,0,3) < V (2,0,5); (1,1,3) < V (2,1,3); (1,1,3) < V (1,0,3); (1,1,3) < V (1,1,3) Property: e.VT < V e’.VT only if e’ causally follows every event that e causally follows

34 Ch9 Models: Causality Comparison of vector timestamps illustrated p1p1 p2p2 p3p3 p4p4 1 3 2 4 3 2 1 2 4 3 5 1 6 5 4 3 2 1 e 1.VT=(5,4,1,3) ; e 2.VT=(3,6,4,2); e 3.VT=(0,0,1,3) e 3.VT < V e 1.VT; No causal path neither from e 1 to e 2 nor from e 2 to e 1. e 1 and e 2 are concurrent e1e1 e2e2 e3e3

35 Ch9 Models: Causality The property illustrated p1p1 p4p4 p3p3 p2p2 1 2 3 1 2 4 3 5 1 6 4 5 2 3 2 1 3 4 e’ We have that e.VT=(0,1,4,2) e’.VT=(3,6,4,2) e.VT < V e’.VT e e’ causally follows every event that e causally follows

36 Ch9 Models: Causality Vector timestamps algorithm Initially, my_VT = [0,…,0] wait for any event e on e do if e is the receipt of message m then for i := 1 to M do my_VT[i] := max(m.VT[i],my_VT[i])+1; my_VT[self] := my_VT[self] +1 e.VT := my_VT end elseif e is an internal event then my_VT[self] := my_VT[self]+1 ; e.VT := my_VT elseif e is the sending of message m then my_VT[self] := my_VT[self]+1 ; e.VT := my_VT m.VT = my_VT end Here we assume that each processor knows the names of all the processors in the system How can we achieve this assumption ? We’ll see later

37 Ch9 Models: Causality Vector Timestamp algorithm Ensures: e < H e’  e.VT < V e’.VT Reason: 1) e < p e’ : the case of internal events at processor p; e.VT < V e’.VT 2) e < m e’: the case of receiving of message m; e.VT < V e’.VT

38 Ch9 Models: Causality Vector Timestamp algorithm Ensures: e.VT < V e’.VT  e < H e’ Reason: Assume e < H e’ then two cases are to consider 1) if e’ < H e then e’.VT < V e.VT (from previous slide) p e’ k l e And e.VT[p]=l > k which implies that e.VT < V e’.VT

39 Ch9 Models: Causality Vector Timestamp algorithm Ensures (cont.): e.VT < V e’.VT  e < H e’ Reason: Assume e < H e’ then two cases are to consider 2) if e’ < H e then e’.VT < V e.VT and e.VT < V e’.VT

40 Ch9 Models: Causality Detecting causality violation in the dist. object system ex. If we know for every pair of events, whether they are causally related we can detect causality violation in the distributed object system example by installing a causality violation detector at every processor p1p1 p2p2 p3p3 Where is O ? Migrate O on p 2 On p 2 Where is O ? I don’t know Error ! m3m3 m2m2 m1m1 Time If we attach a vector timestamp to each event (and message)of the distributed object system example, then each processor can detect a causality violation e.g. p 2 can detect that a causality violation occurs when it receives m 1 : m 1 < c m 3 but r(m 3 ) < p2 r(m 1 ) (1,0,0) (0,0,1) (3,0,2) (3,0,3) (3,2,4) (3,1,3) (3,2,3) (3,3,3) (2,0,1) (3,0,1)

41 Ch9 Models: Causality Causal communication Causality violation can lead to undesirable situations A processor usually cannot choose the order in which messages arrive. But a processor can decide the order in which application executing on it have messages delivered to them This leads to the need for communication subsystems with specified properties e.g. one may require a communication subsystem that deliver messages in a causal order Advantage: the design of many distributed algorithms would be easy (e.g. simple object migration protocol)

42 Ch9 Models: Causality Causal communication Can we build a communication subsystem that guarantees delivery of messages in causal order? No for unicast message sending, Yes for multicast

43 Ch9 Models: Causality Causal communication (an attempt of solution) Idea: Hold back messages that arrive “too soon”; Deliver a held-back message m only when you are assured that you will not receive m’ such that m’ causally precedes m ; The implementation of this idea is similar to the implementation of FIFO communication Applications CSS Network

44 Ch9 Models: Causality FIFO communication (TCP): the problem Assume 1) p and q are connected by an oriented communication line from p to q that satisfies: messages sent are eventually received; messages sent by p can arrive at q in any order 2) q delivers messages received from p to an application A running at q The problem is to devise a distributed algorithm that enables processor q to deliver to A messages received from p in the order p sent them.

45 Ch9 Models: Causality FIFO communication: implementation(idea) The solution consists of one algorithm for p and one for q. Algorithm for p p sequentially numbers each message it sends to q. q knows that messages should be sequentially numbered. Algorithm for q (idea) upon receipt of a message m with a sequence number x, if q has never received a message with sequence number x-1, q delays the delivery of m until m can be delivered in sequence

46 Ch9 Models: Causality FIFO communication: implementation(idea) Algorithm for q (idea cont.) Message number x No hole, deliver There is a hole, buffer

47 Ch9 Models: Causality Causal communication: implementation(idea) Assumption (PTP):all point-to-point messages are delivered in order sent Instead of using sequence numbers (as for the FIFO implementation) we use timestamps Lamport timestamps or vector timestamps can be used Idea: whenever processor q receives a message m from processor p, q holds back m until it is assured that no message m’ < c m will be delivered from any other processor.

48 Ch9 Models: Causality Causal communication: implementation(idea, variables used) self blocked[i] = queue of blocked messages received from p i earliest[i] = (head(blocked[i])).timestamp OR 1 i if blocked[i] is empty messages in delivery_list are causally ordered delivery_list blocked[1] earliest[1] blocked[i] earliest[i] blocked[M] earliest[M]

49 Ch9 Models: Causality Causal communication: implementation(idea, variables update) When processor self receives a message m from p, it performs the following steps in order: Step1 : If blocked[p] is empty then earliest[p] is set to m.timestamp ; /* because assumption (PTP) guarantees that no earlier message can be received from p */ Step 2: Enqueue message m to blocked[p]; Step 3: Unblock one after another, all blocked messages that can be unblocked; add each unblocked message to deliver_list; update earliest if necessary How to determine when a message can be unblocked? Step 4: Deliver messages in deliver_list

50 Ch9 Models: Causality Causal communication: implementation(idea, variables update) Step 3 detailed: Assume we use vector timestamps Step 3 refined: Unblock one after another, all blocked messages that can be unblocked; the message m at the head of the holding queue for processor k can be unblocked only if the “time” of processor k according to message m is smaller than the “time” of processor k according to any other message m’ if any, at the head of a holding queue More precisely, blocked[k] can be unblocked only if (  i  {1,..,M}  i  k  i  self : earliest[k][i] < earliest[i][i]) Thus, the details of Step 3 are:

51 Ch9 Models: Causality Causal communication: implementation(idea, variables update) Step 3 detailed (cont.): blocked[k] can be unblocked only if (  i  {1,..,M}  i  k  i  self : earliest[k][i] < earliest[i][i]) combining the above condition with the fact that messages are unblocked one after another, we obtain a while loop. While ( (  k  {1,..,M} : blocked[k]  empty)  (  i  {1,..,M}  i  k  i  self : earliest[k][i] < earliest[i][i])) do remove the first message of blocked[k] and add this message to delivery_list; if blocked[k]  empty then earliest[k] := (head(blocked[k])).timestamp /* vector timestamp */ else earliest[k] := earliest[k] + 1 k end Deliver the messages in delivery_list

52 Ch9 Models: Causality Causal communication: implementation(the complete scheme) Initially for each k in {1,..,M}, earliest[k] := 1 k ; blocked[k] := empty Wait for a message from any processor on the receipt of message m from processor p do deliver_list := empty; Step 1; Step 2 ; Step 3; Step 4 end

53 Ch9 Models: Causality Detecting causality violation in the dist. object system ex. If we know for every pair of events, whether they are causally related we can detect causality violation in the distributed object system example by installing a causality violation detector at every processor p1p1 p2p2 p3p3 Where is O ? Migrate O on p 2 On p 2 Where is O ? I don’t know Error ! m3m3 m2m2 m1m1 Time (1,0,0) (0,0,1) (3,0,2) (3,0,3) (3,2,4) (3,1,3) (3,2,3) (3,3,3) (2,0,1) (3,0,1)

54 Ch9 Models: Causality Problem of the causal communication implementation previously given. One problem that the algorithm presented for causal communication has is that the communication subsystem at processor self might never deliver some messages

55 Ch9 Models: Causality Causal communication: problems illustrated (1,0,0,0) (1,0,0,2) (3,0,1,0) (2,0,1,0) (0,0,1,0) (3,0,2,0) (3,0,3,0) (3,1,3,0) M p3p3 p1p1 p2p2 p4p4 Message M is never delivered by the communication subsystem running at processor p 2 blocked[p 3 ]   ; M=head(blocked[p 3 ]) earliest[p 3 ][p 1 ]=3 and blocked[p 1 ] =  ; earliest[p 1 ][p 1 ]=1 blocked[p 4 ] =  ; earliest[p 4 ][p 1 ]=1 self is processor p 2

56 Ch9 Models:Distributed Snapshots Assumptions/definitions The system is connected, that is there is a path from every pair of processors C i,j channel from p i to p j ; Communication channels : reliable and FIFO messages sent are eventually received in order; State of C i,j is the ordered list of messages sent by p i but not yet received at p j ; (we will soon make this definition precise) State of a processor (at an instant) is the assignment of a value to each variable of that processor;

57 Ch9 Models:Distributed Snapshots Assumptions (cont.) Global state of the system: (S,L) S =(s 1,.., s M ) processor states ; L = channel states A global state cannot be taken instantaneously it must be computed in a distributed manner; The problem: Devise a distributed algorithm that computes a consistent global state. What do we mean by consistent global state?

58 Ch9 Models:Distributed Snapshots Meaning of consistent global state Example 1 Cq,p q Cp,q p Two possible states for each processor: s0, s1 In s0: the processor hasn’t the token In s1: the processor has the token The system contains exactly one token which moves back and forth between p and q. Initially, p has the token. Events: sending/receiving the token. Cq,p

59 Ch9 Models:Distributed Snapshots Meaning of consistent global state Global states of the system of Example 1 q Cp,q p Cq,p q Cp,q p Cq,p q Cp,q p Cq,p q Cp,q p Cq,p

60 Ch9 Models:Distributed Snapshots Meaning of consistent global state (informal) A global state G is consistent if it is one that could have occurred Actual transitions G The output of the snapshot algorithm can be G ! Consider a system with two possible runs (non-determinism)

61 Ch9 Models:Distributed Snapshots Consistent global state (formal) S={s 1,..,s M }; o i : event of observing s i at p i ; O(S)={o 1,..,o M } Definition: S is a consistent cut iff {o 1,..,o M } is consistent with causality Definition: {o 1,..,o M } is consistent with causality iff (  e, o i : e in E i  e < H o i : (  e’ : e’ in E j  e ’ < H e : e ’ < H o j ) ) Notation:s(m)= event of sending m; r(m)= event of receiving m oioi pipi pjpj e e’ ojoj Intuition

62 Ch9 Models:Distributed Snapshots Precision about ´´message sent but not yet received´´ Definition: Given O(S)={o 1,..,o M }; m a message. If s(m) < pi o i  o j < pj r(m) then m is sent but not yet received (relatively to O). o1o1 p1p1 p2p2 o2o2 m1 m2 m3 p 2 observes its state, then asks p 1 to do the same The global state resulting from o1 and o2 must contain: m1,m2,m3

63 Ch9 Models:Distributed Snapshots Meaning of consistent global state (cont.) Definition: A global state (S,L) is consistent if S is a consistent cut L contains all messages sent but not yet received (relatively to O(S))

64 Ch9 Models:Distributed Snapshots Examples of global states (questions) o3o3 o1o1 o2o2 Is O={o 1,o 2,o 3 } consistent with causality? o’ 3 o’ 1 o’ 2 Is O’={o’ 1,o’ 2,o’ 3 } consistent with causality? p1p1 p2p2 p3p3 p1p1 p2p2 p3p3

65 Ch9 Models:Distributed Snapshots Why a consistent global state is useful (an example)? Processors p 1 and p 2 make use of resources r 1 and r 2 A deadlock global state of a distributed system is one in which there is cycle in the wait-for graph Deadlock property: Once a distributed system enters a deadlock state, all subsequent global state are deadlock states. p1p1 r1r1 r2r2 p2 Req Rel Ok Req Ok 4 3 2 1 Assume we have a “tough guy” called deadlock detector whose goal is to observe the processors and the resources at some points of their processing then checks if there is a cycle in the wait-for graph if so, he claim that there is a deadlock Our guy observes the processors and the resources at the points marked 1 through 4

66 Ch9 Models:Distributed Snapshots Why a consistent global state is useful (ex., cont.)? p1p1 r1r1 r2r2 p2 Req Rel Ok Req Ok 4 3 2 1 The deadlock detector observes the processors and the resources at the points marked 1 through 4 and finds : p2p2 p1p1 r2r2 r1r1 Where x y Means x is waiting for y To see why, assume a correct transaction for using a resource consists of three steps: Req Ok Rel

67 Ch9 Models:Distributed Snapshots Why a consistent global state is useful (ex., cont.)? 4 p1p1 r1r1 r2r2 p2 Req Rel Ok Req Ok 3 2 1 The deadlock detector observes the processors and the resources at the points marked 1 through 4 a finds : p2p2 p1p1 r2r2 r1r1 Where x y Means x is waiting for y Is there actually a deadlock in the system? The answer is NO. There is only a phantom deadlock. The claim of our guy is due to the fact that he made an inconsistent observation that led to a wrong result!

68 Ch9 Models:Distributed Snapshots The snapshot algorithm(Informal) Uses special messages: snapshot tokens (stok) There are two types of participating processors : initiating, others The algorithm for the initiating processor: Records its state; Sends a stok to each outgoing channel; Starts to record state of incoming channels. Recording of the state of an incoming channel c is finished when a stok is received along it.

69 Ch9 Models:Distributed Snapshots The snapshot algorithm(Informal cont.) Uses special messages: snapshot tokens (stok) Types of participating processors: initiating, others The algorithm for any other processor: Records its state on receipt of a stok for the first time; (assume the first stok is received along c). Records the state of c as empty; Sends one stok to each outgoing channel; Starts to record the state of all other incoming channels; Recording of the state of an incoming channel c´  c is finished when a stok is received along it.

70 Ch9 Models:Distributed Snapshots The snapshot algorithm(Idea, cont.) Notation: T(p,state): time at p when p records its state; T(p,stok,c) : time at p when p receives a stok along c The state of an incoming channel c of p is the sequence of messages that p receives in the interval ] T(p,state), T(p,stok,c) [ Recall that the state of c is recorded by p.

71 Ch9 Models:Distributed Snapshots The snapshot algorithm illustrated: Taking a snapshot of a token passing system p records its state: s 0 and send stok q Cp,q p Cq,p Cp,q stok s0s0 s0s0 q p Cq,p stok s 0 L pq ={} s0s0 q Cp,q p Cq,p stok s0s0 s1s1 q Cp,q p Cq,p s1s1 s 0 L qp ={ } s0s0 q receives stok: q records its state and the state of C p,q then sends stok p received the token and stok arrives then p records the state of C q,p Recorded global state: S={s 0, s 0 } L={L pq, L qp }

72 Ch9 Models:Distributed Snapshots Applications of snapshots Detecting stable state predicates (or properties) A state predicate P is said to be stable if P(G)  P(G’) for every G’ that is reachable from G Examples: Deadlock; Termination; lost of token; etc.

73 Ch9 Models:Distributed Snapshots The snapshot algorithm(in the book) Accounts for the possibility of different concurrent snapshots; To achieve this, Each snapshot is identified by the name of the initiating processor A processor might initiate a new snapshot while the first is still being collected; version number To achieve this, Version numbers are used (for simplicity, when a processor r requests a new version of the snapshot, the old snapshot is cancelled) diffusing computation: one useful technique for designing distributed algorithms

74 Ch9 Models:Diffusing computation Diffusing computation Assume a connected network (i.e. for each pair of processors in the system, there is a path connecting them) and that messages sent are eventually received The problem : A processor p has an information Info that it wants to send to all other processors. p Processors that are directly connected are called neighbors Each processor knows its neighbors

75 Ch9 Models:Diffusing computation Diffusing computation (a solution) The algorithm for the initiator i for each neighbor k send(k,Info) The algorithm for any other processor wait for message from any neighbor on receipt of Info from some neighbor p do for each neighbor k  p send(k,Info) end There are two problems with this algorithm: Problem 1: there might be unprocessed messages left in some channels Problem 2: processor p does not know if and when all other processors have received Info

76 Ch9 Models:Diffusing computation The algorithm for the initiator i Step1: for each k in my_neighbors send(k,Info) Step 2: my_wlist:=my_neighbors; while my_wlist is not empty do wait for message from any k in my_wlist on receipt of Info from k in my_wlist do my_wlist:= my_wlist\ {k} end Diffusing computation (a solution,cont.) Solution to problem 1 and 2: we want the initiator to be informed of the fact that all the processors have received Info Variables used: my_neighbors: the set of identities of all my neighbors; my_wlist: the list of neighbors from which I am waiting for a message containing Info

77 Ch9 Models:Diffusing computation (a solution,cont.: The algorithm for a non-initiating processor consists of three steps: Step 1, Step 2 and Step 3 in that order ) Step1: wait for a message from any k in my_neighbors on receipt of Info from k do my_parent := k; for each j in my_neighbors\{k} send(j, Info), end Step 2: my_wlist:=my_neighbors \{my_parent}; while my_wlist is not empty do wait for message from any k in my_wlist on receipt of Info from k in my_wlist do my_wlist:= my_wlist\ {k} end Step 3: send(my_parent, Info) Why this distributed algorithm is correct (i.e. each processor receives Info and the initiator eventually learns that each processor has received Info, no deadlock)?

78 Ch9 Models:Diffusing computation p Channels along which processors received Info for the first time Spanning tree construction A spanning tree of a graph is a tree whose nodes are all those in the graph and whose edges are a subset of those in the graph

79 Ch9 Models:Distributed computation Formal models (non-deterministic interleaving): Understand how distributed computations actually occur Intuition: A distributed system has: Global states: (S,L) see Snapshots; Initially, each processor is in an initial local state; each communication channel is empty Events: occurrence of an event causes a transition of the system from the current global state to a new global state; Computations: sequences of events from intial global states;

80 Ch9 Models:Distributed computation More precisely: An event e=(p,s,s’,m,c); p in P; s, s’ local states of p; m in M  NULL; (M= set of all possible messages) c in C  NULL (C=set of all channels); Interpretation of e=(p,s,s’,m,c): e takes p from s to s’ and possibly sends or receives m on c. If m (and c) is NULL then e is an internal event; No channel is affected by the occurrence of e Otherwise If c is an incoming channel then m is removed from c. If c is an outgoing channel then m is added to c.

81 Ch9 Models:Distributed computation Occurrence of event (execution of an event): An event e=(p,s,s’,m,c) can occur in a global state G only if some condition, termed enabling condition of e is satisfied in G. The enabling condition of e=(p,s,s’,m,c) is a condition on the state of p and the channels attached to p; example: the program counter has a specific value; Transition of the system: If e=(p,s,s’,m,c) can occur in G, then the execution of e by p changes the global state by changing only the state of p and possibly the state of one channel attached to p.

82 Ch9 Models:Distributed computation More precisely (cont.) Two functions: Let G be a global state; e and event. Ready(G) = the set of all events that can occur in G; Next(G,e) = the global state just after the occurrence of e. Assume: G 0 = initial global state; G i = the global state when event ei occurs; seq = a sequence of events. Definition: seq is a computation of the system if 1) (  i in {0,…,n} : e i in Ready(G i )) 2) (  i in {0,…,n} : G i+1 =Next(G i, e i )) Note: non-deterministic selection in Ready(G i ).

83 Ch9 Models:Distributed computation Correctness: State predicate: assertion on global states; Correctness property: assertion on computations. Definition: A distributed algorithm is correct if each of its computations satisfies the correctness property. Proving correctness: Show that each global state reachable from the initial global state satisfies some well-defined state predicate. In general, one uses invariant assertions.

84 Ch9 Models:Distributed computation ``Eventually´´ and ``Always´´ properties: Let G 0 be an initial global state; R(G 0 )= all computations that start in G 0 ; A a state predicate; Q an assertion on computation. eventually: eventually(A,G 0,Q) means starting from G 0, for any computation for which Q holds, there is a global state that satisfies A (from now on, something good will happen) always: always(A,G 0,Q) means A is always true starting from G 0 for any computation for which Q holds,

85 Ch9 Models:Distributed computation Failures in a distributed system In a distributed system, failures occur An additional complication in designing distributed algorithm for a distributed system to be dependable, fault tolerance must be incorporated a fault tolerant algorithm is one which minimizes the impact of certain faults on the service provided by the system Fault Classification: fail-stop; timing fault, byzantine; transient faults, etc.

Distributed Algorithms Luc J. B. Onana Alima Seif Haridi.

Similar presentations

Presentation on theme: "Distributed Algorithms Luc J. B. Onana Alima Seif Haridi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Algorithms Luc J. B. Onana Alima Seif Haridi.

Similar presentations

Presentation on theme: "Distributed Algorithms Luc J. B. Onana Alima Seif Haridi."— Presentation transcript:

Similar presentations

About project

Feedback