November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms.

November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms

November 2005Distributed systems: distributed algorithms 2 Overview of chapters Introduction Co-ordination models and languages General services Distributed algorithms –Ch 10 Time and global states, 11.4-11.5 –Ch 11 Coordination and agreement, 12.1-12.5 Shared data Building distributed services

November 2005Distributed systems: distributed algorithms 3 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems

November 2005Distributed systems: distributed algorithms 4 Logical clocks Problem: ordering of events –requirement for many algorithms –physical clocks cannot be used use causality: –within a single process: observation –between different processes: sending of a message happens before receiving the same message

November 2005Distributed systems: distributed algorithms 5 Logical clocks (cont.) Formalization: happens before relation Rules: –if x happens before y in any process p then x  y –for any message m: send (m)  receive (m) –if x  y and y  z then x  z Implementation: logical clocks x  y

November 2005Distributed systems: distributed algorithms 6 Logical clocks (cont.) Logical clock –counter appropriately incremented –one counter per process Physical clock –counts oscillations occurring in a crystal at a definitive frequency

November 2005Distributed systems: distributed algorithms 7 Logical clocks (cont.) Rules for incrementing local logical clock 1for each event (including send) in process p: C p := C p + 1 2when a process sends a message m, it piggybacks on m the value of C p 3on receiving (m, t), a process q computes C q := max (C q, t) applies rule 1: C q := C q +1 C q is logical time for event receive(m)

November 2005Distributed systems: distributed algorithms 8 1 a Logical clocks (cont.) Logical timestamps: example P1P1 P2P2 P3P3 1 b 2 c 3 d 4 e 5 f 3 g 0

November 2005Distributed systems: distributed algorithms 9 Logical clocks (cont.) C(x) logical clock value for event x Correct usage: if x  y then C(x) < C(y) Incorrect usage: if C(x) < C(y) then x  y Solution: Logical vector clocks

November 2005Distributed systems: distributed algorithms 10 Logical clocks (cont.) Vector clocks for N processes: –at process P i : V i [j] for j = 1, 2,…,N –Properties: if x  y then V(x) < V(y) if V(x) < V(y) then x  y

November 2005Distributed systems: distributed algorithms 11 Logical clocks (cont.) Rules for incrementing logical vector clock 1for each event (including send) in process P i : V i [i] := V i [i] + 1 2when a process P i sends a message m, it piggybacks on m the value of V i 3on receiving (m, t), a process P i apply rule 1 V i [j] := max(V i [j], t[j]) for j = 1, 2,…, N

November 2005Distributed systems: distributed algorithms 12 Logical clocks (cont.) Logical vector clocks : example

November 2005Distributed systems: distributed algorithms 14 Global states Detect global properties

November 2005Distributed systems: distributed algorithms 15 Global states (cont.) Local states & events –Process P i : e i k events s i k state, before event k –History of P i : h i = –Finite prefix of history of P i : h i k =

November 2005Distributed systems: distributed algorithms 16 Global states (cont.) Global states & events –Global history H = h 1  h 2  h 3  …  h n –Global state (when?) S = ( s 1 p, s 2 q, …, s n u ) consistent? –Cut of the systems execution C = h 1 c1  h 1 c2  …  h 1 cn

November 2005Distributed systems: distributed algorithms 17 Global states (cont.) Example of cuts:

November 2005Distributed systems: distributed algorithms 18 Global states (cont.) Finite prefix of history of P i : h i k = Cut of the systems execution C = h 1 c1  h 1 c2  …  h 1 cn Consistent cut C  e  C, f  e  f  C Consistent global state corresponds to consistent cut

November 2005Distributed systems: distributed algorithms 19 Global states (cont.) Model execution of a (distributed) system S 0  S 1  S 2  S 3  … –Series of transitions between consistent states –Each transition corresponds to one single event Internal event Sending message Receiving message –Simultaneous events  order events

November 2005Distributed systems: distributed algorithms 20 Global states (cont.) Definitions: –Run = ordering of all events (in a global history) consistent with each local history’s ordering –Linearization = consistent run + consistent with  –S ’ reachable from S  linearization: …  S  …  S ’  …

November 2005Distributed systems: distributed algorithms 21 Global states (cont.) Kinds of global state predicates: –Stable –Safety –Liveness  = true in S  S ’, S  …  S ’   = true in S ’  = undesirable property S 0 = initial state of system  S, S 0  …  S   = false in S  = desirable property S 0 = initial state of system  S, S 0  …  S   = true in S

November 2005Distributed systems: distributed algorithms 22 Global states (cont.) Snapshot algorithm of Chandy & Lamport –Record consistent global state –Assumptions: Neither channels nor processes fail Channels are unidirectional and provide FIFO- ordered message delivery Graph of channels and processes is strongly connected Any process may initiate a global snapshot Process may continue their execution during the snapshot

November 2005Distributed systems: distributed algorithms 23 Global states (cont.) Snapshot algorithm of Chandy & Lamport –Elements of algorithm Players: processes P i with –Incoming channels –Outgoing channels Marker messages 2 rules –Marker receiving rule –Marker sending rule –Start of algorithm A process acts as it received a marker message

November 2005Distributed systems: distributed algorithms 24 Global states (cont.) Marker receiving rule for process p i On p i ’s receipt of a marker message over channel c: if (p i has not yet recorded its state) it records its process state now; records the state of c as the empty set; turns on recording of messages arriving over other incoming channels; else p i records the state of c as the set of messages it has received over c since it saved its state. end if Marker sending rule for process p i After p i has recorded its state, for each outgoing channel c: p i sends one marker message over c (before it sends any other message over c).

November 2005Distributed systems: distributed algorithms 25 Global states (cont.) Example:

November 2005Distributed systems: distributed algorithms 26 Global states (cont.) p 1 p 2 (empty) (empty) c 2 c 1 1. Global state S 0 2. Global state S 1 p 1 p 2 (Order 10, $100), M (empty) c 2 c 1 3. Global state S 2 p 1 p 2 (Order 10, $100), M (five widgets) c 2 c 1 (M = marker message) 4. Global state S 3 p 1 p 2 (Order 10, $100) (empty) c 2 c 1 C2 = <> C1=

November 2005Distributed systems: distributed algorithms 27 Global states (cont.) p 1 p 2 (empty) (empty) c 2 c 1 1. Global state S 0 (M = marker message) 4. Global state S 3 p 1 p 2 (Order 10, $100) (empty) c 2 c 1 C2 = <> C1= 5. Global state S 4 p 1 p 2 (Order 10, $100) M c 2 c 1 C2 = <> C1= 6. Global state S 5 p 1 p 2 (Order 10, $100) (empty) c 2 c 1 C2 = <> C1=

November 2005Distributed systems: distributed algorithms 28 Global states (cont.) Observed state –Corresponds to consistent cut –Reachable!

November 2005Distributed systems: distributed algorithms 30 Failure detectors Properties –Unreliable failure detector: answers with Suspected Unsuspected –Reliable failure detector: answers with Failed Unsuspected Implementation –Every T sec: multicast by P of “P is here” –Maximum on message transmission time: Asynchronous system: estimate E Synchronous system: absolute bound A No “P is here” within T + E secNo “P is here” within T + A sec

November 2005Distributed systems: distributed algorithms 32 Mutual exclusion Problem: how to give a single process temporarily a privilege? –Privilege = the right to access a (shared) resource –resource = file, device, window,… Assumptions –clients execute the mutual exclusion algorithm –the resource itself might be managed by a server –Reliable communication

November 2005Distributed systems: distributed algorithms 33 Mutual exclusion (cont.) Basic requirements: –ME1: at most one process might execute in the shared resource at any time (Safety) –ME2: a process requesting access to the shared resource is eventually granted it (Liveness) –ME3: Access to the shared resource should be granted in happened-before order (Ordering or fairness)

November 2005Distributed systems: distributed algorithms 34 Mutual exclusion (cont.) Solutions: –central server algorithm –distributed algorithm using logical clocks –ring-based algorithm –voting algorithm Evaluation –Bandwidth (= #messages to enter and exit) –Client delay (incurred by a process at enter and exit) –Synchronization delay (delay between exit and enter )

November 2005Distributed systems: distributed algorithms 35 Mutual exclusion (cont.) central server algorithm Central server offering 2 operations: –enter() if resource free then operation returns without delay else request is queued and return from operation is delayed –exit() if request queue is empty then resource is marked free else return for a selected request is executed

November 2005Distributed systems: distributed algorithms 36 Mutual exclusion (cont.) central server algorithm Example: Server Queue: P1P1 P2P2 P3P3 P4P4 User Enter()

November 2005Distributed systems: distributed algorithms 37 Mutual exclusion (cont.) central server algorithm Example: Server Queue: P1P1 P2P2 P3P3 P4P4 User 3 Enter()

November 2005Distributed systems: distributed algorithms 38 Mutual exclusion (cont.) central server algorithm Example: Server Queue: P1P1 P2P2 P3P3 P4P4 User 3

November 2005Distributed systems: distributed algorithms 39 Mutual exclusion (cont.) central server algorithm Example: Server Queue: P1P1 P2P2 P3P3 P4P4 User 3 enter()

November 2005Distributed systems: distributed algorithms 40 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4 P1P1 P2P2 P3P3 P4P4 User 3 enter()

November 2005Distributed systems: distributed algorithms 42 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4, 2 P1P1 P2P2 P3P3 P4P4 User 3 enter()

November 2005Distributed systems: distributed algorithms 43 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4, 2 P1P1 P2P2 P3P3 P4P4 User 3 enter() exit()

November 2005Distributed systems: distributed algorithms 44 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4, 2 P1P1 P2P2 P3P3 P4P4 User enter()

November 2005Distributed systems: distributed algorithms 46 Mutual exclusion (cont.) central server algorithm Evaluation: –ME3 not satisfied! –Performance: single server is performance bottleneck Enter critical section: 2 messages Synchronization: 2 messages between exit of one process and enter of next –Failure: Central server is single point of failure what if a client, holding the resource, fails? Reliable communication required

November 2005Distributed systems: distributed algorithms 47 Mutual exclusion (cont.) ring-based algorithm All processes arranged in a –unidirectional –logical ring token passed in ring process with token has access to resource

November 2005Distributed systems: distributed algorithms 48 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6

November 2005Distributed systems: distributed algorithms 49 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6 P 2 can use resource

November 2005Distributed systems: distributed algorithms 50 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6 P 2 stopped using resource and forwarded token

November 2005Distributed systems: distributed algorithms 51 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6 P 3 doesn’t need resource and forwards token

November 2005Distributed systems: distributed algorithms 52 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6

November 2005Distributed systems: distributed algorithms 53 Mutual exclusion (cont.) ring-based algorithm Evaluation: –ME3 not satisfied –efficiency high when high usage of resource high overhead when very low usage –failure Process failure: loss of ring! Reliable communication required

November 2005Distributed systems: distributed algorithms 54 Mutual exclusion (cont.) distributed algorithm using logical clocks Distributed agreement algorithm –multicast requests to all participating processes –use resource when all other participants agree (= reply received) Processes –keep logical clock; included in all request messages –behave as finite state machine: released wanted held

November 2005Distributed systems: distributed algorithms 55 Mutual exclusion (cont.) distributed algorithm using logical clocks Ricart and Agrawala’s algorithm: process P j –on initialization: state := released; –to obtain resource: state := wanted; T = logical clock value for next event; multicast request to other processes ; wait for n-1 replies; state := held;

November 2005Distributed systems: distributed algorithms 56 Mutual exclusion (cont.) distributed algorithm using logical clocks Ricart and Agrawala’s algorithm: process P j –on receipt of request : if (state = held) or (state = wanted and (T,P j ) < (T i,P i ) ) then queue request from P i else reply immediately to P i –to release resource: state := released; reply to any queued requests;

November 2005Distributed systems: distributed algorithms 57 Mutual exclusion (cont.) distributed algorithm using logical clocks Ricart and Agrawala’s algorithm: example –3 processes –P 1 and P 2 will request it concurrently –P 3 not interested in using resource

November 2005Distributed systems: distributed algorithms 58 Mutual exclusion (cont.) distributed algorithm using logical clocks Ricart and Agrawala’s algorithm: example P 1 released Queue: P 2 released Queue: P 3 released Queue:

November 2005Distributed systems: distributed algorithms 59 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 released Queue: P 3 released Queue: 0

November 2005Distributed systems: distributed algorithms 60 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P 3 released Queue: 0 0

November 2005Distributed systems: distributed algorithms 64 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P1P1 P 3 released Queue: 1 1

November 2005Distributed systems: distributed algorithms 67 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 held Queue: P1P1 P 3 released Queue: 1 2

November 2005Distributed systems: distributed algorithms 68 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 held Queue: P1P1 P 3 released Queue: 1 2

November 2005Distributed systems: distributed algorithms 69 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 released Queue: P 3 released Queue: 1 0

November 2005Distributed systems: distributed algorithms 70 Mutual exclusion (cont.) distributed algorithm using logical clocks Evaluation –Performance: expensive algorithm: 2 * ( n - 1 ) messages to get resource Client delay: round trip Synchronization delay: 1 message to pass section to another process does not solve the performance bottleneck –Failure each process must know all other processes Reliable communication required

November 2005Distributed systems: distributed algorithms 71 Mutual exclusion (cont.) voting algorithm Approach of Maekawa: –Communication with subset of partners should suffice –Candidate collects sufficient votes Voting set: V i = voting set for p i –  i, j: V i  V j   –| V i | = K –P j contained in M voting sets –Optimal solution K ~  N M = K

November 2005Distributed systems: distributed algorithms 72 Mutual exclusion (cont.) voting algorithm On initialization state := RELEASED; voted := FALSE; For p i to enter the critical section state := WANTED; Multicast request to all processes in V i – {p i }; Wait until (number of replies received = (K – 1)); state := HELD; On receipt of a request from p i at p j (i ≠ j) if (state = HELD or voted = TRUE) then queue request from p i without replying; else send reply to p i ; voted := TRUE; end if Maekawa’s algorithm

November 2005Distributed systems: distributed algorithms 73 Mutual exclusion (cont.) voting algorithm Maekawa’s algorithm (cont.) For p i to exit the critical section state := RELEASED; Multicast release to all processes in V i – {p i }; On receipt of a release from p i at p j (i ≠ j) if (queue of requests is non-empty) then remove head of queue – from p k, say; send reply to p k ; voted := TRUE; else voted := FALSE; end if

November 2005Distributed systems: distributed algorithms 74 Mutual exclusion (cont.) voting algorithm Evaluation –Properties ME1: OK ME2: NOK, deadlock possible solution: process requests in  order ME3: Ok –Performance Bandwidth: on enter 2  N messages + on exit  N messages Client delay: round trip Synchronization delay: round trip –Failure: Crash of process on another voting set can be tolerated Reliable communication required

November 2005Distributed systems: distributed algorithms 75 Mutual exclusion (cont.) Discussion –algorithms are expensive and not practical –algorithms are extremely complex in the presence of failures –better solution in most cases: let the server, managing the resource, perform concurrency control gives more transparency for the clients

November 2005Distributed systems: distributed algorithms 77 Elections Problem statement: –select a process from a group of processes –several processes may start election concurrently Main requirement: –unique choice –Select process with highest id Process id

November 2005Distributed systems: distributed algorithms 78 Elections (cont.) Basic requirements: –E1: participant p i set elected i =  or elected i = P; P process with highest id (Safety) –E2: all processes p i participate and set elected i  or crash (Liveness)  not yet defined

November 2005Distributed systems: distributed algorithms 79 Elections (cont.) Solutions: –Bully election algorithm –Ring based election algorithm Evaluation –Bandwidth ( ~ total number of messages) –Turnaround time (the number of serialized message transmission times between initiation and termination of a single run)

November 2005Distributed systems: distributed algorithms 80 Elections (cont.) Bully election Assumptions: –each process has identifier –processes can fail during an election –communication is reliable Goal: –surviving member with the largest identifier is elected as coordinator

November 2005Distributed systems: distributed algorithms 81 Elections (cont.) Bully election Roles for processes: –coordinator elected process has highest identifier, at the time of election –initiator process starting the election for some reason

November 2005Distributed systems: distributed algorithms 82 Elections (cont.) Bully election Three types of messages: –election message sent by an initiator of the election to all other processes with a higher identifier –answer message a reply message sent by the receiver of an election message –coordinator message sent by the process becoming the coordinator to all other processes with lower identifiers

November 2005Distributed systems: distributed algorithms 83 Elections (cont.) Bully election Algorithm: –send election message: process doing it is called initiator any process may do it at any time when a failed process is restarted, it starts an election, even though the current coordinator is functioning (bully) –a process receiving an election message replies with an answer message will start an election itself (why?)

November 2005Distributed systems: distributed algorithms 84 Elections (cont.) Bully election Algorithm: –actions of an initiator when not receiving an answer message within a certain time (2T trans +T process ) becomes coordinator when having received an answer message (  a process with a higher identifier is active) and not receiving a coordinator message (after x time units) will restart elections

November 2005Distributed systems: distributed algorithms 85 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 1 initiator election

November 2005Distributed systems: distributed algorithms 86 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 2 and P 3 reply and start election election answer

November 2005Distributed systems: distributed algorithms 87 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 Election messages of P 2 arrive election answer

November 2005Distributed systems: distributed algorithms 88 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 3 replies election answer

November 2005Distributed systems: distributed algorithms 89 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 3 fails election answer

November 2005Distributed systems: distributed algorithms 90 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 Timeout at P 1 : election starts again election answer

November 2005Distributed systems: distributed algorithms 91 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 Timeout at P 1 : election starts again election answer

November 2005Distributed systems: distributed algorithms 92 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 2 replies and starts election election answer

November 2005Distributed systems: distributed algorithms 93 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 2 receives no replies  coordinator election answer

November 2005Distributed systems: distributed algorithms 94 Elections (cont.) Bully election Evaluation –Correctness: E1 & E2 ok, if Reliable communication No process replaces crashed process –Correctness: no guarantee for E1, if Crashed process is replaced by process with same id Assumed timeout values are inaccurate (= unreliable failure detector) –Performance Worst case: O(n 2 ) Optimal:bandwidth: n-2 messages turnaround: 1 message

November 2005Distributed systems: distributed algorithms 95 Elections (cont.) Ring based election Assumptions: –processes arranged in a logical ring –each process has an identifier: i for P i –processes remain functional and reachable during the algorithm

November 2005Distributed systems: distributed algorithms 96 Elections (cont.) Ring based election Messages: –forwarded over logical ring –2 types: election: used during election contains identifier elected: used to announce new coordinator Process States: –participant –non-participant

November 2005Distributed systems: distributed algorithms 97 Elections (cont.) Ring based election Algorithm –process initiating an election becomes participant sends election message to its neighbour

November 2005Distributed systems: distributed algorithms 98 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 initiator 11

November 2005Distributed systems: distributed algorithms 99 Elections (cont.) Ring based election Algorithm –upon receiving an election message, a process compares identifiers: Received: identifier in message own: identifier of process –3 cases: Received > own Received < own Received = own

November 2005Distributed systems: distributed algorithms 100 Elections (cont.) Ring based election Algorithm –receive election message Received > own –message forwarded –process becomes participant

November 2005Distributed systems: distributed algorithms 103 Elections (cont.) Ring based election Algorithm –receive election message Received > own –message forwarded –process becomes participant Received < own and process is non-participant –substitutes own identifier in message –message forwarded –process becomes participant

November 2005Distributed systems: distributed algorithms 108 Elections (cont.) Ring based election Algorithm: –receive election message Received > own –... Received < own and process is non-participant –... Received = own –identifier must be greatest –process becomes coordinator –new state: non-participant –sends elected message to neighbour

November 2005Distributed systems: distributed algorithms 109 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator 21

November 2005Distributed systems: distributed algorithms 110 Elections (cont.) Ring based election Algorithm receive election message Received > own –message forwarded –process becomes participant Received < own and process is non-participant –substitutes own identifier in message –message forwarded –process becomes participant Received = own –identifier must be greatest –process becomes coordinator –new state: non-participant –sends elected message to neighbour

November 2005Distributed systems: distributed algorithms 111 Elections (cont.) Ring based election Algorithm –receive elected message participant: –new state: non-participant –forwards message coordinator: –election process completed

November 2005Distributed systems: distributed algorithms 118 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator

November 2005Distributed systems: distributed algorithms 119 Elections (cont.) Ring based election Evaluation –Why is condition Received < own and process is non-participant necessary? (see next slide for full algorithm) –Number of messages: worst case: 3 * n - 1 best case: 2 * n –concurrent elections: messages are extinguished as soon as possible before winning result is announced

November 2005Distributed systems: distributed algorithms 120 Elections (cont.) Ring based election Algorithm receive election message Received > own –message forwarded –process becomes participant Received < own and process is non-participant –substitutes own identifier in message –message forwarded –process becomes participant Received = own –identifier must be greatest –process becomes coordinator –new state: non-participant –sends elected message to neighbour

November 2005Distributed systems: distributed algorithms 122 Multicast communication Essential property: –1 multicast operation <> multiple sends  Higher efficiency  Stronger delivery guarantees Operations: g = group, m = message –X-multicast(g, m) –X-deliver(m) <> receive(m) –X  additional property Basic, Reliable, FifO,….

November 2005Distributed systems: distributed algorithms 123 Multicast communication (cont.) IP multicast Datagram operations –with multicast IP address Failure model cfr UDP –Omission failures –No ordering or reliability guarantees

November 2005Distributed systems: distributed algorithms 124 Multicast communication (cont.) Basic multicast = IP multicast + delivery guarantee if multicasting process does not crash Straightforward algorithm: (with reliable send) Ex. practical algorithm using IP-multicast To B-multicast(g, m):  p  g: send(p, m) On receive(m) at p: B-deliver(m)

November 2005Distributed systems: distributed algorithms 125 Multicast communication (cont.) Reliable multicast Properties: –Integrity (safety) A correct process delivers a message at most once –Validity (liveness) Correct process p multicasts m  p delivers m –Agreement (liveness)  correct process p delivering m  all correct processes will deliver m –Uniform agreement (liveness)  process p (correct or failing) delivering m  all correct processes will deliver m

November 2005Distributed systems: distributed algorithms 126 Multicast communication (cont.) Reliable multicast 2 algorithms: 1.Using B-multicast 2.Using IP-multicast + piggy backed acks

November 2005Distributed systems: distributed algorithms 127 Multicast communication (cont.) Reliable multicast Algorithm 1 with B-multicast

November 2005Distributed systems: distributed algorithms 128 Multicast communication (cont.) Reliable multicast Correct? –Integrity –Validity –Agreement Efficient? –NO: each message transmitted  g  times Algorithm 1 with B-multicast

November 2005Distributed systems: distributed algorithms 129 Multicast communication (cont.) Reliable multicast Algorithm 2 with IP-multicast

November 2005Distributed systems: distributed algorithms 130 Multicast communication (cont.) Reliable multicast Algorithm 2 with IP-multicast Data structures at process p: S g p : sequence number R g q : sequence number of the latest message it has delivered from q On initialization: S g p = 0 For process p to R-multicast message m to group g IP-multicast (g, >) S g p ++ On IP-deliver ( >) at q from p

November 2005Distributed systems: distributed algorithms 131 Multicast communication (cont.) Reliable multicast Algorithm 2 with IP-multicast On IP-deliver ( >) at q from p if S = R g p + 1 then R-deliver (m) R g p ++ check hold-back queue else if S > R g p + 1 then store m in hold-back queue request missing messages endif endif if R > R g then request missing messages endif

November 2005Distributed systems: distributed algorithms 132 Multicast communication (cont.) Reliable multicast Correct? –Integrity: seq numbers + checksums –Validity: if missing messages are detected –Agreement: if copy of message remains available Algorithm 2 with IP-multicast

November 2005Distributed systems: distributed algorithms 133 Multicast communication (cont.) Reliable multicast 3 processes in group: P, Q, R State of process: –S: Next_sequence_number –R q : Already_delivered from Q –Stored messages Presentation: Algorithm 2 with IP-multicast: example P: 2 Q: 3 R: 5

November 2005Distributed systems: distributed algorithms 134 Multicast communication (cont.) Reliable multicast Initial state: Algorithm 2 with IP-multicast: example P: 0 Q: -1 R: -1 Q: 0 P: -1 R: -1 R: 0 P: -1 Q: -1

November 2005Distributed systems: distributed algorithms 135 Multicast communication (cont.) Reliable multicast First multicast by P: Algorithm 2 with IP-multicast: example P: 1 Q: -1 R: -1 Q: 0 P: -1 R: -1 R: 0 P: -1 Q: -1 P: m p0, 0,

November 2005Distributed systems: distributed algorithms 136 Multicast communication (cont.) Reliable multicast Arrival multicast by P at Q: Algorithm 2 with IP-multicast: example P: 1 Q: -1 R: -1 Q: 0 P: 0 R: -1 R: 0 P: -1 Q: -1 P: m p0, 0,

November 2005Distributed systems: distributed algorithms 137 Multicast communication (cont.) Reliable multicast New state: Algorithm 2 with IP-multicast: example P: 1 Q: -1 R: -1 Q: 0 P: 0 R: -1 R: 0 P: -1 Q: -1

November 2005Distributed systems: distributed algorithms 138 Multicast communication (cont.) Reliable multicast Multicast by Q: Algorithm 2 with IP-multicast: example P: 1 Q: -1 R: -1 Q: 1 P: 0 R: -1 R: 0 P: -1 Q: -1 Q: m q0, 0,

November 2005Distributed systems: distributed algorithms 139 Multicast communication (cont.) Reliable multicast Arrival of multicast by Q: Algorithm 2 with IP-multicast: example P: 1 Q: 0 R: -1 Q: 1 P: 0 R: -1 R: 0 P: -1 Q: 0 Q: m q0, 0,

November 2005Distributed systems: distributed algorithms 140 Multicast communication (cont.) Reliable multicast When to delete stored messages? Algorithm 2 with IP-multicast: example P: 1 Q: 0 R: -1 Q: 1 P: 0 R: -1 R: 0 P: -1 Q: 0

November 2005Distributed systems: distributed algorithms 141 Multicast communication (cont.) Ordered multicast FIFO Causal Total if a correct process P: multicast(g, m); multicast(g, m’); then for all correct processes: deliver(m’)  deliver(m) before deliver(m’) if multicast(g, m)  multicast(g, m’) then for all correct processes: deliver(m’)  deliver(m) before deliver(m’)

November 2005Distributed systems: distributed algorithms 142 Multicast communication (cont.) Ordered multicast Total FIFO-Total = FIFO + Total Causal-Total = Causal + Total Atomic = reliable + Total if  p: deliver(m)  deliver( m’) then for all correct processes: deliver(m’)  deliver(m) before deliver(m’)

November 2005Distributed systems: distributed algorithms 143 Multicast communication (cont.) Ordered multicast Notice the consistent ordering of totally ordered messages T 1 and T 2, the FIFO-related messages F 1 and F 2 and the causally related messages C 1 and C 3 – and the otherwise arbitrary delivery ordering of messages.

November 2005Distributed systems: distributed algorithms 144 Multicast communication (cont.) FIFO multicast Alg. 1: R-multicast using IP-multicast –Correct? Sender assigns S g p Receivers deliver in this order Alg. 2 on top of any B-multicast

November 2005Distributed systems: distributed algorithms 145 Multicast communication (cont.) FIFO multicast Algorithm 2 on top of any B-multicast Data structures at process p: S g p : sequence number R g q : sequence number of the latest message it has delivered from q On initialization: S g p = 0; R g q = -1 For process p to FO-multicast message m to group g B-multicast ( g, ) S g p ++ On B-deliver ( ) at q from p

November 2005Distributed systems: distributed algorithms 146 Multicast communication (cont.) FIFO multicast Algorithm 2 on top of any B-multicast On B-deliver ( ) at q from p if S = R g p + 1 then FO-deliver (m) R g p ++ check hold-back queue else if S > R g p + 1 then store m in hold-back queue endif endif

November 2005Distributed systems: distributed algorithms 147 Multicast communication (cont.) TOTAL multicast Basic approach: –Sender: assign totally ordered identifiers iso process ids –Receiver: deliver as for FIFO ordering Alg. 1: use a (single) sequencer process Alg. 2: participants collectively agree on the assignment of sequence numbers

November 2005Distributed systems: distributed algorithms 148 Multicast communication (cont.) TOTAL multicast: sequencer process

November 2005Distributed systems: distributed algorithms 149 Multicast communication (cont.) TOTAL multicast: sequencer process Correct? Problems? –A single sequencer process bottleneck single point of failure

November 2005Distributed systems: distributed algorithms 150 Multicast communication (cont.) TOTAL multicast: ISIS algorithm Approach: –Sender: B-multicasts message –Receivers: Propose sequence numbers to sender –Sender: uses returned sequence numbers to generate agreed sequence number

November 2005Distributed systems: distributed algorithms 151 Multicast communication (cont.) TOTAL multicast: ISIS algorithm 2 1 1 2 2 1 Message 2 Proposed Seq P 2 P 3 P 1 P 4 3 Agreed Seq 3 3

November 2005Distributed systems: distributed algorithms 152 Multicast communication (cont.) TOTAL multicast: ISIS algorithm Data structures at process p: A g p : largest agreed sequence number P g p : largest proposed sequence number by P On initialization: P g p = 0 For process p to TO-multicast message m to group g B-multicast ( g, ) /i = unique id for m On B-deliver ( ) at q from p P g q = max (A g q, P g q ) + 1 send (p, ) store in hold-back queue

November 2005Distributed systems: distributed algorithms 153 Multicast communication (cont.) TOTAL multicast: ISIS algorithm On receive( i, P) at p from q wait for all replies; a is the largest reply B-multicast(g, ) On B-deliver ( ) at q from p A g q = max (A g q, a) attach a to message i in hold-back queue reorder messages in hold-back queue (increasing sequence numbers) while message m in front of hold-back queue has been assigned an agreed sequence number doremove m from hold-back queue TO-deliver (m)

November 2005Distributed systems: distributed algorithms 154 Multicast communication (cont.) TOTAL multicast: ISIS algorithm Correct? –Processes will agree on sequence number for a message –Sequence numbers are monotonically increasing –No process can prematurely deliver a message Performance –3 serial messages! Total ordering –<> FIFO –<> causal

November 2005Distributed systems: distributed algorithms 155 Multicast communication (cont.) Causal multicast Limitations: –Causal order only by multicast operations –Non-overlapping, closed groups Approach: –Use vector timestamps –Timestamp = count number of multicast messages

November 2005Distributed systems: distributed algorithms 156 Multicast communication (cont.) Causal multicast: vector timestamps Meaning?

November 2005Distributed systems: distributed algorithms 157 Multicast communication (cont.) Causal multicast: vector timestamps Correct? –Messagetimestamp m V m’ V’ –Given multicast(g,m)  multicast(g,m’) proof V < V’

November 2005Distributed systems: distributed algorithms 159 Consensus & related problems System model –N processes p i –Communication is reliable –Processes mail fail Crash Byzantine –No message signing Message signing limits the harm a faulty process can do Problems –Consensus –Byzantine generals –Interactive consistency

November 2005Distributed systems: distributed algorithms 160 Consensus Problem statement –  p i : undecided state & p i proposes v i –Message exchanges –Finally: each process p i sets decision variable d i, enters decided state and may not change d i Requirements: –Termination: eventually each correct process p i sets d i –Agreement: p i and p j correct & in decided state  d i = d j –Integrity: correct processes all propose same value d  any process in decided state has chosen d

November 2005Distributed systems: distributed algorithms 161 Consensus Simple algorithm (no process failures) –Collect processes in group g –For each process p i : R-multicast(g, v i ) Collect N values d = majority (v 1,v 2,...,v N ) Problems with failures: –Crash: detected? not in asynchronous systems –Byzantine? Faulty process can send around different values

November 2005Distributed systems: distributed algorithms 162 Byzantine generals Problem statement –Informal: agree to attack or to retreat commander issues the order lieutenants are to decide to attack or retreat all can be ‘treacherous’ –Formal One process proposes value Others are to agree Requirements: –Termination: each process eventually decides –Agreement: all correct processes select the same value –Integrity: commander correct  other correct processes select value of commander

November 2005Distributed systems: distributed algorithms 163 Interactive consistency Problem statement –Correct processes agree upon a vector of values (one value for each process) Requirements: –Termination: each process eventually decides –Agreement: all correct processes select the same value –Integrity: if p i correct then all correct processes decide on v i as the i-the component of their vector

November 2005Distributed systems: distributed algorithms 164 Related problems & solutions Basic solutions: –Consensus: C i (v 1, v 2,..., v N ) = decision value of p i –Byzantine generals j is commander, proposing v BG i (j, v) = decision value of p i –Interactive consensus IC i (v 1, v 2,..., v N )[j] = jth value in the decision vector of p i with v 1, v 2,..., v N values proposed by processes

November 2005Distributed systems: distributed algorithms 165 Related problems & solutions Derived solutions: –IC from BG: Run BG N times once with each process as commander IC i (v 1, v 2,..., v N )[j] = BG(j, v j ) –C from IC: Run IC and produce vector of values Derive single value with appropriate function –BG from C:

November 2005Distributed systems: distributed algorithms 166 Related problems & solutions Derived solutions: –IC from BG: –C from IC: –BG from C: Commander p j sends value v to itself & other processes Each process runs C with the values v 1, v 2,...v N they received BG i (j,v) = C i (v 1, v 2,...v N )

November 2005Distributed systems: distributed algorithms 167 Consensus in a synchronous system Assumptions –Use B-multicast –f of N processes may fail Approach –Proceed in f + 1 rounds –In each round correct processes B-multicast values Variables –Values i r = set of proposed values known to process p i at the beginning of round r

November 2005Distributed systems: distributed algorithms 168 Consensus in a synchronous system

November 2005Distributed systems: distributed algorithms 169 Consensus in a synchronous system Termination? –Synchronous system Correct? –Each process arrives at the same set of values at the end of the final round –Proof? Assume sets are different... P i has value v & P j doesn’t have value v  P k : managed to send v to P i and not to P j Agreement & Integrity? –Processes apply the same function

November 2005Distributed systems: distributed algorithms 170 Byzantine generals in a synchronous system Assumptions –Arbitrary failures –f of N processes may be faulty –Channels are reliable: no message injections –Unsigned messages

November 2005Distributed systems: distributed algorithms 171 Byzantine generals in a synchronous system Impossibility with 3 processes Impossibility with N <= 3f processes p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u p 1 (Commander) p 2 p 3 1:x1:w 2:1:w 3:1:x Faulty processes are shown shaded

November 2005Distributed systems: distributed algorithms 172 Byzantine generals in a synchronous system Solution with one faulty process –N = 4, f = 1 –2 rounds of messages: Commander sends its value to each of the lieutenants Each of the lieutenants sends the value it received to its peers –Lieutenant receives Value of commander N-2 values of peers Use majority function

November 2005Distributed systems: distributed algorithms 173 Byzantine generals in a synchronous system Solution with one faulty process p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u Faulty processes are shown shaded p 4 1:v 4:1:v 2:1:v3:1:w 4:1:v p 1 (Commander) p 2 p 3 1:w1:u 2:1:u 3:1:w p 4 1:v 4:1:v 2:1:u3:1:w 4:1:v

November 2005Distributed systems: distributed algorithms 174 Consensus & related problems Impossibility in asynchronous systems –  proof: No algorithm exists to reach consensus –No guaranteed solution to Byzantine generals problem Interactive consistency Reliable & totally ordered multicast Approaches for work around: –Masking faults: restart crashed process and use persistent storage –Use failure detectors: make failure fail-silent by discarding messages

November 2005Distributed systems: distributed algorithms 176 Distributed Systems: Distributed algorithms

November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms.

Similar presentations

Presentation on theme: "November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms.

Similar presentations

Presentation on theme: "November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms."— Presentation transcript:

Similar presentations

About project

Feedback