Download presentation
Presentation is loading. Please wait.
Published byDelilah Jones Modified over 9 years ago
1
November 2005Distributed systems: distributed algorithms 1 Distributed Systems: Distributed algorithms
2
November 2005Distributed systems: distributed algorithms 2 Overview of chapters Introduction Co-ordination models and languages General services Distributed algorithms –Ch 10 Time and global states, 11.4-11.5 –Ch 11 Coordination and agreement, 12.1-12.5 Shared data Building distributed services
3
November 2005Distributed systems: distributed algorithms 3 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems
4
November 2005Distributed systems: distributed algorithms 4 Logical clocks Problem: ordering of events –requirement for many algorithms –physical clocks cannot be used use causality: –within a single process: observation –between different processes: sending of a message happens before receiving the same message
5
November 2005Distributed systems: distributed algorithms 5 Logical clocks (cont.) Formalization: happens before relation Rules: –if x happens before y in any process p then x y –for any message m: send (m) receive (m) –if x y and y z then x z Implementation: logical clocks x y
6
November 2005Distributed systems: distributed algorithms 6 Logical clocks (cont.) Logical clock –counter appropriately incremented –one counter per process Physical clock –counts oscillations occurring in a crystal at a definitive frequency
7
November 2005Distributed systems: distributed algorithms 7 Logical clocks (cont.) Rules for incrementing local logical clock 1for each event (including send) in process p: C p := C p + 1 2when a process sends a message m, it piggybacks on m the value of C p 3on receiving (m, t), a process q computes C q := max (C q, t) applies rule 1: C q := C q +1 C q is logical time for event receive(m)
8
November 2005Distributed systems: distributed algorithms 8 1 a Logical clocks (cont.) Logical timestamps: example P1P1 P2P2 P3P3 1 b 2 c 3 d 4 e 5 f 3 g 0
9
November 2005Distributed systems: distributed algorithms 9 Logical clocks (cont.) C(x) logical clock value for event x Correct usage: if x y then C(x) < C(y) Incorrect usage: if C(x) < C(y) then x y Solution: Logical vector clocks
10
November 2005Distributed systems: distributed algorithms 10 Logical clocks (cont.) Vector clocks for N processes: –at process P i : V i [j] for j = 1, 2,…,N –Properties: if x y then V(x) < V(y) if V(x) < V(y) then x y
11
November 2005Distributed systems: distributed algorithms 11 Logical clocks (cont.) Rules for incrementing logical vector clock 1for each event (including send) in process P i : V i [i] := V i [i] + 1 2when a process P i sends a message m, it piggybacks on m the value of V i 3on receiving (m, t), a process P i apply rule 1 V i [j] := max(V i [j], t[j]) for j = 1, 2,…, N
12
November 2005Distributed systems: distributed algorithms 12 Logical clocks (cont.) Logical vector clocks : example
13
November 2005Distributed systems: distributed algorithms 13 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems
14
November 2005Distributed systems: distributed algorithms 14 Global states Detect global properties
15
November 2005Distributed systems: distributed algorithms 15 Global states (cont.) Local states & events –Process P i : e i k events s i k state, before event k –History of P i : h i = –Finite prefix of history of P i : h i k =
16
November 2005Distributed systems: distributed algorithms 16 Global states (cont.) Global states & events –Global history H = h 1 h 2 h 3 … h n –Global state (when?) S = ( s 1 p, s 2 q, …, s n u ) consistent? –Cut of the systems execution C = h 1 c1 h 1 c2 … h 1 cn
17
November 2005Distributed systems: distributed algorithms 17 Global states (cont.) Example of cuts:
18
November 2005Distributed systems: distributed algorithms 18 Global states (cont.) Finite prefix of history of P i : h i k = Cut of the systems execution C = h 1 c1 h 1 c2 … h 1 cn Consistent cut C e C, f e f C Consistent global state corresponds to consistent cut
19
November 2005Distributed systems: distributed algorithms 19 Global states (cont.) Model execution of a (distributed) system S 0 S 1 S 2 S 3 … –Series of transitions between consistent states –Each transition corresponds to one single event Internal event Sending message Receiving message –Simultaneous events order events
20
November 2005Distributed systems: distributed algorithms 20 Global states (cont.) Definitions: –Run = ordering of all events (in a global history) consistent with each local history’s ordering –Linearization = consistent run + consistent with –S ’ reachable from S linearization: … S … S ’ …
21
November 2005Distributed systems: distributed algorithms 21 Global states (cont.) Kinds of global state predicates: –Stable –Safety –Liveness = true in S S ’, S … S ’ = true in S ’ = undesirable property S 0 = initial state of system S, S 0 … S = false in S = desirable property S 0 = initial state of system S, S 0 … S = true in S
22
November 2005Distributed systems: distributed algorithms 22 Global states (cont.) Snapshot algorithm of Chandy & Lamport –Record consistent global state –Assumptions: Neither channels nor processes fail Channels are unidirectional and provide FIFO- ordered message delivery Graph of channels and processes is strongly connected Any process may initiate a global snapshot Process may continue their execution during the snapshot
23
November 2005Distributed systems: distributed algorithms 23 Global states (cont.) Snapshot algorithm of Chandy & Lamport –Elements of algorithm Players: processes P i with –Incoming channels –Outgoing channels Marker messages 2 rules –Marker receiving rule –Marker sending rule –Start of algorithm A process acts as it received a marker message
24
November 2005Distributed systems: distributed algorithms 24 Global states (cont.) Marker receiving rule for process p i On p i ’s receipt of a marker message over channel c: if (p i has not yet recorded its state) it records its process state now; records the state of c as the empty set; turns on recording of messages arriving over other incoming channels; else p i records the state of c as the set of messages it has received over c since it saved its state. end if Marker sending rule for process p i After p i has recorded its state, for each outgoing channel c: p i sends one marker message over c (before it sends any other message over c).
25
November 2005Distributed systems: distributed algorithms 25 Global states (cont.) Example:
26
November 2005Distributed systems: distributed algorithms 26 Global states (cont.) p 1 p 2 (empty) (empty) c 2 c 1 1. Global state S 0 2. Global state S 1 p 1 p 2 (Order 10, $100), M (empty) c 2 c 1 3. Global state S 2 p 1 p 2 (Order 10, $100), M (five widgets) c 2 c 1 (M = marker message) 4. Global state S 3 p 1 p 2 (Order 10, $100) (empty) c 2 c 1 C2 = <> C1=
27
November 2005Distributed systems: distributed algorithms 27 Global states (cont.) p 1 p 2 (empty) (empty) c 2 c 1 1. Global state S 0 (M = marker message) 4. Global state S 3 p 1 p 2 (Order 10, $100) (empty) c 2 c 1 C2 = <> C1= 5. Global state S 4 p 1 p 2 (Order 10, $100) M c 2 c 1 C2 = <> C1= 6. Global state S 5 p 1 p 2 (Order 10, $100) (empty) c 2 c 1 C2 = <> C1=
28
November 2005Distributed systems: distributed algorithms 28 Global states (cont.) Observed state –Corresponds to consistent cut –Reachable!
29
November 2005Distributed systems: distributed algorithms 29 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems
30
November 2005Distributed systems: distributed algorithms 30 Failure detectors Properties –Unreliable failure detector: answers with Suspected Unsuspected –Reliable failure detector: answers with Failed Unsuspected Implementation –Every T sec: multicast by P of “P is here” –Maximum on message transmission time: Asynchronous system: estimate E Synchronous system: absolute bound A No “P is here” within T + E secNo “P is here” within T + A sec
31
November 2005Distributed systems: distributed algorithms 31 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems
32
November 2005Distributed systems: distributed algorithms 32 Mutual exclusion Problem: how to give a single process temporarily a privilege? –Privilege = the right to access a (shared) resource –resource = file, device, window,… Assumptions –clients execute the mutual exclusion algorithm –the resource itself might be managed by a server –Reliable communication
33
November 2005Distributed systems: distributed algorithms 33 Mutual exclusion (cont.) Basic requirements: –ME1: at most one process might execute in the shared resource at any time (Safety) –ME2: a process requesting access to the shared resource is eventually granted it (Liveness) –ME3: Access to the shared resource should be granted in happened-before order (Ordering or fairness)
34
November 2005Distributed systems: distributed algorithms 34 Mutual exclusion (cont.) Solutions: –central server algorithm –distributed algorithm using logical clocks –ring-based algorithm –voting algorithm Evaluation –Bandwidth (= #messages to enter and exit) –Client delay (incurred by a process at enter and exit) –Synchronization delay (delay between exit and enter )
35
November 2005Distributed systems: distributed algorithms 35 Mutual exclusion (cont.) central server algorithm Central server offering 2 operations: –enter() if resource free then operation returns without delay else request is queued and return from operation is delayed –exit() if request queue is empty then resource is marked free else return for a selected request is executed
36
November 2005Distributed systems: distributed algorithms 36 Mutual exclusion (cont.) central server algorithm Example: Server Queue: P1P1 P2P2 P3P3 P4P4 User Enter()
37
November 2005Distributed systems: distributed algorithms 37 Mutual exclusion (cont.) central server algorithm Example: Server Queue: P1P1 P2P2 P3P3 P4P4 User 3 Enter()
38
November 2005Distributed systems: distributed algorithms 38 Mutual exclusion (cont.) central server algorithm Example: Server Queue: P1P1 P2P2 P3P3 P4P4 User 3
39
November 2005Distributed systems: distributed algorithms 39 Mutual exclusion (cont.) central server algorithm Example: Server Queue: P1P1 P2P2 P3P3 P4P4 User 3 enter()
40
November 2005Distributed systems: distributed algorithms 40 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4 P1P1 P2P2 P3P3 P4P4 User 3 enter()
41
November 2005Distributed systems: distributed algorithms 41 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4 P1P1 P2P2 P3P3 P4P4 User 3 enter()
42
November 2005Distributed systems: distributed algorithms 42 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4, 2 P1P1 P2P2 P3P3 P4P4 User 3 enter()
43
November 2005Distributed systems: distributed algorithms 43 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4, 2 P1P1 P2P2 P3P3 P4P4 User 3 enter() exit()
44
November 2005Distributed systems: distributed algorithms 44 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 4, 2 P1P1 P2P2 P3P3 P4P4 User enter()
45
November 2005Distributed systems: distributed algorithms 45 Mutual exclusion (cont.) central server algorithm Example: Server Queue: 2 P1P1 P2P2 P3P3 P4P4 User 4 enter()
46
November 2005Distributed systems: distributed algorithms 46 Mutual exclusion (cont.) central server algorithm Evaluation: –ME3 not satisfied! –Performance: single server is performance bottleneck Enter critical section: 2 messages Synchronization: 2 messages between exit of one process and enter of next –Failure: Central server is single point of failure what if a client, holding the resource, fails? Reliable communication required
47
November 2005Distributed systems: distributed algorithms 47 Mutual exclusion (cont.) ring-based algorithm All processes arranged in a –unidirectional –logical ring token passed in ring process with token has access to resource
48
November 2005Distributed systems: distributed algorithms 48 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6
49
November 2005Distributed systems: distributed algorithms 49 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6 P 2 can use resource
50
November 2005Distributed systems: distributed algorithms 50 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6 P 2 stopped using resource and forwarded token
51
November 2005Distributed systems: distributed algorithms 51 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6 P 3 doesn’t need resource and forwards token
52
November 2005Distributed systems: distributed algorithms 52 Mutual exclusion (cont.) ring-based algorithm P1P1 P4P4 P3P3 P2P2 P5P5 P6P6
53
November 2005Distributed systems: distributed algorithms 53 Mutual exclusion (cont.) ring-based algorithm Evaluation: –ME3 not satisfied –efficiency high when high usage of resource high overhead when very low usage –failure Process failure: loss of ring! Reliable communication required
54
November 2005Distributed systems: distributed algorithms 54 Mutual exclusion (cont.) distributed algorithm using logical clocks Distributed agreement algorithm –multicast requests to all participating processes –use resource when all other participants agree (= reply received) Processes –keep logical clock; included in all request messages –behave as finite state machine: released wanted held
55
November 2005Distributed systems: distributed algorithms 55 Mutual exclusion (cont.) distributed algorithm using logical clocks Ricart and Agrawala’s algorithm: process P j –on initialization: state := released; –to obtain resource: state := wanted; T = logical clock value for next event; multicast request to other processes ; wait for n-1 replies; state := held;
56
November 2005Distributed systems: distributed algorithms 56 Mutual exclusion (cont.) distributed algorithm using logical clocks Ricart and Agrawala’s algorithm: process P j –on receipt of request : if (state = held) or (state = wanted and (T,P j ) < (T i,P i ) ) then queue request from P i else reply immediately to P i –to release resource: state := released; reply to any queued requests;
57
November 2005Distributed systems: distributed algorithms 57 Mutual exclusion (cont.) distributed algorithm using logical clocks Ricart and Agrawala’s algorithm: example –3 processes –P 1 and P 2 will request it concurrently –P 3 not interested in using resource
58
November 2005Distributed systems: distributed algorithms 58 Mutual exclusion (cont.) distributed algorithm using logical clocks Ricart and Agrawala’s algorithm: example P 1 released Queue: P 2 released Queue: P 3 released Queue:
59
November 2005Distributed systems: distributed algorithms 59 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 released Queue: P 3 released Queue: 0
60
November 2005Distributed systems: distributed algorithms 60 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P 3 released Queue: 0 0
61
November 2005Distributed systems: distributed algorithms 61 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P 3 released Queue: 0 0
62
November 2005Distributed systems: distributed algorithms 62 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P 3 released Queue: 0 0
63
November 2005Distributed systems: distributed algorithms 63 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P 3 released Queue: 1 1
64
November 2005Distributed systems: distributed algorithms 64 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P1P1 P 3 released Queue: 1 1
65
November 2005Distributed systems: distributed algorithms 65 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P1P1 P 3 released Queue: 1 1
66
November 2005Distributed systems: distributed algorithms 66 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 wanted Queue: P1P1 P 3 released Queue: 1 1
67
November 2005Distributed systems: distributed algorithms 67 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 held Queue: P1P1 P 3 released Queue: 1 2
68
November 2005Distributed systems: distributed algorithms 68 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 held Queue: P1P1 P 3 released Queue: 1 2
69
November 2005Distributed systems: distributed algorithms 69 Mutual exclusion (cont.) distributed algorithm using logical clocks P 1 wanted Queue: P 2 released Queue: P 3 released Queue: 1 0
70
November 2005Distributed systems: distributed algorithms 70 Mutual exclusion (cont.) distributed algorithm using logical clocks Evaluation –Performance: expensive algorithm: 2 * ( n - 1 ) messages to get resource Client delay: round trip Synchronization delay: 1 message to pass section to another process does not solve the performance bottleneck –Failure each process must know all other processes Reliable communication required
71
November 2005Distributed systems: distributed algorithms 71 Mutual exclusion (cont.) voting algorithm Approach of Maekawa: –Communication with subset of partners should suffice –Candidate collects sufficient votes Voting set: V i = voting set for p i – i, j: V i V j –| V i | = K –P j contained in M voting sets –Optimal solution K ~ N M = K
72
November 2005Distributed systems: distributed algorithms 72 Mutual exclusion (cont.) voting algorithm On initialization state := RELEASED; voted := FALSE; For p i to enter the critical section state := WANTED; Multicast request to all processes in V i – {p i }; Wait until (number of replies received = (K – 1)); state := HELD; On receipt of a request from p i at p j (i ≠ j) if (state = HELD or voted = TRUE) then queue request from p i without replying; else send reply to p i ; voted := TRUE; end if Maekawa’s algorithm
73
November 2005Distributed systems: distributed algorithms 73 Mutual exclusion (cont.) voting algorithm Maekawa’s algorithm (cont.) For p i to exit the critical section state := RELEASED; Multicast release to all processes in V i – {p i }; On receipt of a release from p i at p j (i ≠ j) if (queue of requests is non-empty) then remove head of queue – from p k, say; send reply to p k ; voted := TRUE; else voted := FALSE; end if
74
November 2005Distributed systems: distributed algorithms 74 Mutual exclusion (cont.) voting algorithm Evaluation –Properties ME1: OK ME2: NOK, deadlock possible solution: process requests in order ME3: Ok –Performance Bandwidth: on enter 2 N messages + on exit N messages Client delay: round trip Synchronization delay: round trip –Failure: Crash of process on another voting set can be tolerated Reliable communication required
75
November 2005Distributed systems: distributed algorithms 75 Mutual exclusion (cont.) Discussion –algorithms are expensive and not practical –algorithms are extremely complex in the presence of failures –better solution in most cases: let the server, managing the resource, perform concurrency control gives more transparency for the clients
76
November 2005Distributed systems: distributed algorithms 76 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems
77
November 2005Distributed systems: distributed algorithms 77 Elections Problem statement: –select a process from a group of processes –several processes may start election concurrently Main requirement: –unique choice –Select process with highest id Process id
78
November 2005Distributed systems: distributed algorithms 78 Elections (cont.) Basic requirements: –E1: participant p i set elected i = or elected i = P; P process with highest id (Safety) –E2: all processes p i participate and set elected i or crash (Liveness) not yet defined
79
November 2005Distributed systems: distributed algorithms 79 Elections (cont.) Solutions: –Bully election algorithm –Ring based election algorithm Evaluation –Bandwidth ( ~ total number of messages) –Turnaround time (the number of serialized message transmission times between initiation and termination of a single run)
80
November 2005Distributed systems: distributed algorithms 80 Elections (cont.) Bully election Assumptions: –each process has identifier –processes can fail during an election –communication is reliable Goal: –surviving member with the largest identifier is elected as coordinator
81
November 2005Distributed systems: distributed algorithms 81 Elections (cont.) Bully election Roles for processes: –coordinator elected process has highest identifier, at the time of election –initiator process starting the election for some reason
82
November 2005Distributed systems: distributed algorithms 82 Elections (cont.) Bully election Three types of messages: –election message sent by an initiator of the election to all other processes with a higher identifier –answer message a reply message sent by the receiver of an election message –coordinator message sent by the process becoming the coordinator to all other processes with lower identifiers
83
November 2005Distributed systems: distributed algorithms 83 Elections (cont.) Bully election Algorithm: –send election message: process doing it is called initiator any process may do it at any time when a failed process is restarted, it starts an election, even though the current coordinator is functioning (bully) –a process receiving an election message replies with an answer message will start an election itself (why?)
84
November 2005Distributed systems: distributed algorithms 84 Elections (cont.) Bully election Algorithm: –actions of an initiator when not receiving an answer message within a certain time (2T trans +T process ) becomes coordinator when having received an answer message ( a process with a higher identifier is active) and not receiving a coordinator message (after x time units) will restart elections
85
November 2005Distributed systems: distributed algorithms 85 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 1 initiator election
86
November 2005Distributed systems: distributed algorithms 86 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 2 and P 3 reply and start election election answer
87
November 2005Distributed systems: distributed algorithms 87 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 Election messages of P 2 arrive election answer
88
November 2005Distributed systems: distributed algorithms 88 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 3 replies election answer
89
November 2005Distributed systems: distributed algorithms 89 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 3 fails election answer
90
November 2005Distributed systems: distributed algorithms 90 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 Timeout at P 1 : election starts again election answer
91
November 2005Distributed systems: distributed algorithms 91 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 Timeout at P 1 : election starts again election answer
92
November 2005Distributed systems: distributed algorithms 92 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 2 replies and starts election election answer
93
November 2005Distributed systems: distributed algorithms 93 Elections (cont.) Bully election Example: election of P 2 after failure of P 3 and P 4 P1P1 P3P3 P2P2 P4P4 P 2 receives no replies coordinator election answer
94
November 2005Distributed systems: distributed algorithms 94 Elections (cont.) Bully election Evaluation –Correctness: E1 & E2 ok, if Reliable communication No process replaces crashed process –Correctness: no guarantee for E1, if Crashed process is replaced by process with same id Assumed timeout values are inaccurate (= unreliable failure detector) –Performance Worst case: O(n 2 ) Optimal:bandwidth: n-2 messages turnaround: 1 message
95
November 2005Distributed systems: distributed algorithms 95 Elections (cont.) Ring based election Assumptions: –processes arranged in a logical ring –each process has an identifier: i for P i –processes remain functional and reachable during the algorithm
96
November 2005Distributed systems: distributed algorithms 96 Elections (cont.) Ring based election Messages: –forwarded over logical ring –2 types: election: used during election contains identifier elected: used to announce new coordinator Process States: –participant –non-participant
97
November 2005Distributed systems: distributed algorithms 97 Elections (cont.) Ring based election Algorithm –process initiating an election becomes participant sends election message to its neighbour
98
November 2005Distributed systems: distributed algorithms 98 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 initiator 11
99
November 2005Distributed systems: distributed algorithms 99 Elections (cont.) Ring based election Algorithm –upon receiving an election message, a process compares identifiers: Received: identifier in message own: identifier of process –3 cases: Received > own Received < own Received = own
100
November 2005Distributed systems: distributed algorithms 100 Elections (cont.) Ring based election Algorithm –receive election message Received > own –message forwarded –process becomes participant
101
November 2005Distributed systems: distributed algorithms 101 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 initiator 11
102
November 2005Distributed systems: distributed algorithms 102 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 initiator 11
103
November 2005Distributed systems: distributed algorithms 103 Elections (cont.) Ring based election Algorithm –receive election message Received > own –message forwarded –process becomes participant Received < own and process is non-participant –substitutes own identifier in message –message forwarded –process becomes participant
104
November 2005Distributed systems: distributed algorithms 104 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 initiator 11
105
November 2005Distributed systems: distributed algorithms 105 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 initiator 21
106
November 2005Distributed systems: distributed algorithms 106 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 initiator 21
107
November 2005Distributed systems: distributed algorithms 107 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 initiator 21
108
November 2005Distributed systems: distributed algorithms 108 Elections (cont.) Ring based election Algorithm: –receive election message Received > own –... Received < own and process is non-participant –... Received = own –identifier must be greatest –process becomes coordinator –new state: non-participant –sends elected message to neighbour
109
November 2005Distributed systems: distributed algorithms 109 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator 21
110
November 2005Distributed systems: distributed algorithms 110 Elections (cont.) Ring based election Algorithm receive election message Received > own –message forwarded –process becomes participant Received < own and process is non-participant –substitutes own identifier in message –message forwarded –process becomes participant Received = own –identifier must be greatest –process becomes coordinator –new state: non-participant –sends elected message to neighbour
111
November 2005Distributed systems: distributed algorithms 111 Elections (cont.) Ring based election Algorithm –receive elected message participant: –new state: non-participant –forwards message coordinator: –election process completed
112
November 2005Distributed systems: distributed algorithms 112 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator 21
113
November 2005Distributed systems: distributed algorithms 113 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator 21
114
November 2005Distributed systems: distributed algorithms 114 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator 21
115
November 2005Distributed systems: distributed algorithms 115 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator 21
116
November 2005Distributed systems: distributed algorithms 116 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator 21
117
November 2005Distributed systems: distributed algorithms 117 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator 21
118
November 2005Distributed systems: distributed algorithms 118 Elections (cont.) Ring based election P1P1 P 21 P8P8 P 11 P 14 P5P5 coordinator
119
November 2005Distributed systems: distributed algorithms 119 Elections (cont.) Ring based election Evaluation –Why is condition Received < own and process is non-participant necessary? (see next slide for full algorithm) –Number of messages: worst case: 3 * n - 1 best case: 2 * n –concurrent elections: messages are extinguished as soon as possible before winning result is announced
120
November 2005Distributed systems: distributed algorithms 120 Elections (cont.) Ring based election Algorithm receive election message Received > own –message forwarded –process becomes participant Received < own and process is non-participant –substitutes own identifier in message –message forwarded –process becomes participant Received = own –identifier must be greatest –process becomes coordinator –new state: non-participant –sends elected message to neighbour
121
November 2005Distributed systems: distributed algorithms 121 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems
122
November 2005Distributed systems: distributed algorithms 122 Multicast communication Essential property: –1 multicast operation <> multiple sends Higher efficiency Stronger delivery guarantees Operations: g = group, m = message –X-multicast(g, m) –X-deliver(m) <> receive(m) –X additional property Basic, Reliable, FifO,….
123
November 2005Distributed systems: distributed algorithms 123 Multicast communication (cont.) IP multicast Datagram operations –with multicast IP address Failure model cfr UDP –Omission failures –No ordering or reliability guarantees
124
November 2005Distributed systems: distributed algorithms 124 Multicast communication (cont.) Basic multicast = IP multicast + delivery guarantee if multicasting process does not crash Straightforward algorithm: (with reliable send) Ex. practical algorithm using IP-multicast To B-multicast(g, m): p g: send(p, m) On receive(m) at p: B-deliver(m)
125
November 2005Distributed systems: distributed algorithms 125 Multicast communication (cont.) Reliable multicast Properties: –Integrity (safety) A correct process delivers a message at most once –Validity (liveness) Correct process p multicasts m p delivers m –Agreement (liveness) correct process p delivering m all correct processes will deliver m –Uniform agreement (liveness) process p (correct or failing) delivering m all correct processes will deliver m
126
November 2005Distributed systems: distributed algorithms 126 Multicast communication (cont.) Reliable multicast 2 algorithms: 1.Using B-multicast 2.Using IP-multicast + piggy backed acks
127
November 2005Distributed systems: distributed algorithms 127 Multicast communication (cont.) Reliable multicast Algorithm 1 with B-multicast
128
November 2005Distributed systems: distributed algorithms 128 Multicast communication (cont.) Reliable multicast Correct? –Integrity –Validity –Agreement Efficient? –NO: each message transmitted g times Algorithm 1 with B-multicast
129
November 2005Distributed systems: distributed algorithms 129 Multicast communication (cont.) Reliable multicast Algorithm 2 with IP-multicast
130
November 2005Distributed systems: distributed algorithms 130 Multicast communication (cont.) Reliable multicast Algorithm 2 with IP-multicast Data structures at process p: S g p : sequence number R g q : sequence number of the latest message it has delivered from q On initialization: S g p = 0 For process p to R-multicast message m to group g IP-multicast (g, >) S g p ++ On IP-deliver ( >) at q from p
131
November 2005Distributed systems: distributed algorithms 131 Multicast communication (cont.) Reliable multicast Algorithm 2 with IP-multicast On IP-deliver ( >) at q from p if S = R g p + 1 then R-deliver (m) R g p ++ check hold-back queue else if S > R g p + 1 then store m in hold-back queue request missing messages endif endif if R > R g then request missing messages endif
132
November 2005Distributed systems: distributed algorithms 132 Multicast communication (cont.) Reliable multicast Correct? –Integrity: seq numbers + checksums –Validity: if missing messages are detected –Agreement: if copy of message remains available Algorithm 2 with IP-multicast
133
November 2005Distributed systems: distributed algorithms 133 Multicast communication (cont.) Reliable multicast 3 processes in group: P, Q, R State of process: –S: Next_sequence_number –R q : Already_delivered from Q –Stored messages Presentation: Algorithm 2 with IP-multicast: example P: 2 Q: 3 R: 5
134
November 2005Distributed systems: distributed algorithms 134 Multicast communication (cont.) Reliable multicast Initial state: Algorithm 2 with IP-multicast: example P: 0 Q: -1 R: -1 Q: 0 P: -1 R: -1 R: 0 P: -1 Q: -1
135
November 2005Distributed systems: distributed algorithms 135 Multicast communication (cont.) Reliable multicast First multicast by P: Algorithm 2 with IP-multicast: example P: 1 Q: -1 R: -1 Q: 0 P: -1 R: -1 R: 0 P: -1 Q: -1 P: m p0, 0,
136
November 2005Distributed systems: distributed algorithms 136 Multicast communication (cont.) Reliable multicast Arrival multicast by P at Q: Algorithm 2 with IP-multicast: example P: 1 Q: -1 R: -1 Q: 0 P: 0 R: -1 R: 0 P: -1 Q: -1 P: m p0, 0,
137
November 2005Distributed systems: distributed algorithms 137 Multicast communication (cont.) Reliable multicast New state: Algorithm 2 with IP-multicast: example P: 1 Q: -1 R: -1 Q: 0 P: 0 R: -1 R: 0 P: -1 Q: -1
138
November 2005Distributed systems: distributed algorithms 138 Multicast communication (cont.) Reliable multicast Multicast by Q: Algorithm 2 with IP-multicast: example P: 1 Q: -1 R: -1 Q: 1 P: 0 R: -1 R: 0 P: -1 Q: -1 Q: m q0, 0,
139
November 2005Distributed systems: distributed algorithms 139 Multicast communication (cont.) Reliable multicast Arrival of multicast by Q: Algorithm 2 with IP-multicast: example P: 1 Q: 0 R: -1 Q: 1 P: 0 R: -1 R: 0 P: -1 Q: 0 Q: m q0, 0,
140
November 2005Distributed systems: distributed algorithms 140 Multicast communication (cont.) Reliable multicast When to delete stored messages? Algorithm 2 with IP-multicast: example P: 1 Q: 0 R: -1 Q: 1 P: 0 R: -1 R: 0 P: -1 Q: 0
141
November 2005Distributed systems: distributed algorithms 141 Multicast communication (cont.) Ordered multicast FIFO Causal Total if a correct process P: multicast(g, m); multicast(g, m’); then for all correct processes: deliver(m’) deliver(m) before deliver(m’) if multicast(g, m) multicast(g, m’) then for all correct processes: deliver(m’) deliver(m) before deliver(m’)
142
November 2005Distributed systems: distributed algorithms 142 Multicast communication (cont.) Ordered multicast Total FIFO-Total = FIFO + Total Causal-Total = Causal + Total Atomic = reliable + Total if p: deliver(m) deliver( m’) then for all correct processes: deliver(m’) deliver(m) before deliver(m’)
143
November 2005Distributed systems: distributed algorithms 143 Multicast communication (cont.) Ordered multicast Notice the consistent ordering of totally ordered messages T 1 and T 2, the FIFO-related messages F 1 and F 2 and the causally related messages C 1 and C 3 – and the otherwise arbitrary delivery ordering of messages.
144
November 2005Distributed systems: distributed algorithms 144 Multicast communication (cont.) FIFO multicast Alg. 1: R-multicast using IP-multicast –Correct? Sender assigns S g p Receivers deliver in this order Alg. 2 on top of any B-multicast
145
November 2005Distributed systems: distributed algorithms 145 Multicast communication (cont.) FIFO multicast Algorithm 2 on top of any B-multicast Data structures at process p: S g p : sequence number R g q : sequence number of the latest message it has delivered from q On initialization: S g p = 0; R g q = -1 For process p to FO-multicast message m to group g B-multicast ( g, ) S g p ++ On B-deliver ( ) at q from p
146
November 2005Distributed systems: distributed algorithms 146 Multicast communication (cont.) FIFO multicast Algorithm 2 on top of any B-multicast On B-deliver ( ) at q from p if S = R g p + 1 then FO-deliver (m) R g p ++ check hold-back queue else if S > R g p + 1 then store m in hold-back queue endif endif
147
November 2005Distributed systems: distributed algorithms 147 Multicast communication (cont.) TOTAL multicast Basic approach: –Sender: assign totally ordered identifiers iso process ids –Receiver: deliver as for FIFO ordering Alg. 1: use a (single) sequencer process Alg. 2: participants collectively agree on the assignment of sequence numbers
148
November 2005Distributed systems: distributed algorithms 148 Multicast communication (cont.) TOTAL multicast: sequencer process
149
November 2005Distributed systems: distributed algorithms 149 Multicast communication (cont.) TOTAL multicast: sequencer process Correct? Problems? –A single sequencer process bottleneck single point of failure
150
November 2005Distributed systems: distributed algorithms 150 Multicast communication (cont.) TOTAL multicast: ISIS algorithm Approach: –Sender: B-multicasts message –Receivers: Propose sequence numbers to sender –Sender: uses returned sequence numbers to generate agreed sequence number
151
November 2005Distributed systems: distributed algorithms 151 Multicast communication (cont.) TOTAL multicast: ISIS algorithm 2 1 1 2 2 1 Message 2 Proposed Seq P 2 P 3 P 1 P 4 3 Agreed Seq 3 3
152
November 2005Distributed systems: distributed algorithms 152 Multicast communication (cont.) TOTAL multicast: ISIS algorithm Data structures at process p: A g p : largest agreed sequence number P g p : largest proposed sequence number by P On initialization: P g p = 0 For process p to TO-multicast message m to group g B-multicast ( g, ) /i = unique id for m On B-deliver ( ) at q from p P g q = max (A g q, P g q ) + 1 send (p, ) store in hold-back queue
153
November 2005Distributed systems: distributed algorithms 153 Multicast communication (cont.) TOTAL multicast: ISIS algorithm On receive( i, P) at p from q wait for all replies; a is the largest reply B-multicast(g, ) On B-deliver ( ) at q from p A g q = max (A g q, a) attach a to message i in hold-back queue reorder messages in hold-back queue (increasing sequence numbers) while message m in front of hold-back queue has been assigned an agreed sequence number doremove m from hold-back queue TO-deliver (m)
154
November 2005Distributed systems: distributed algorithms 154 Multicast communication (cont.) TOTAL multicast: ISIS algorithm Correct? –Processes will agree on sequence number for a message –Sequence numbers are monotonically increasing –No process can prematurely deliver a message Performance –3 serial messages! Total ordering –<> FIFO –<> causal
155
November 2005Distributed systems: distributed algorithms 155 Multicast communication (cont.) Causal multicast Limitations: –Causal order only by multicast operations –Non-overlapping, closed groups Approach: –Use vector timestamps –Timestamp = count number of multicast messages
156
November 2005Distributed systems: distributed algorithms 156 Multicast communication (cont.) Causal multicast: vector timestamps Meaning?
157
November 2005Distributed systems: distributed algorithms 157 Multicast communication (cont.) Causal multicast: vector timestamps Correct? –Messagetimestamp m V m’ V’ –Given multicast(g,m) multicast(g,m’) proof V < V’
158
November 2005Distributed systems: distributed algorithms 158 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems
159
November 2005Distributed systems: distributed algorithms 159 Consensus & related problems System model –N processes p i –Communication is reliable –Processes mail fail Crash Byzantine –No message signing Message signing limits the harm a faulty process can do Problems –Consensus –Byzantine generals –Interactive consistency
160
November 2005Distributed systems: distributed algorithms 160 Consensus Problem statement – p i : undecided state & p i proposes v i –Message exchanges –Finally: each process p i sets decision variable d i, enters decided state and may not change d i Requirements: –Termination: eventually each correct process p i sets d i –Agreement: p i and p j correct & in decided state d i = d j –Integrity: correct processes all propose same value d any process in decided state has chosen d
161
November 2005Distributed systems: distributed algorithms 161 Consensus Simple algorithm (no process failures) –Collect processes in group g –For each process p i : R-multicast(g, v i ) Collect N values d = majority (v 1,v 2,...,v N ) Problems with failures: –Crash: detected? not in asynchronous systems –Byzantine? Faulty process can send around different values
162
November 2005Distributed systems: distributed algorithms 162 Byzantine generals Problem statement –Informal: agree to attack or to retreat commander issues the order lieutenants are to decide to attack or retreat all can be ‘treacherous’ –Formal One process proposes value Others are to agree Requirements: –Termination: each process eventually decides –Agreement: all correct processes select the same value –Integrity: commander correct other correct processes select value of commander
163
November 2005Distributed systems: distributed algorithms 163 Interactive consistency Problem statement –Correct processes agree upon a vector of values (one value for each process) Requirements: –Termination: each process eventually decides –Agreement: all correct processes select the same value –Integrity: if p i correct then all correct processes decide on v i as the i-the component of their vector
164
November 2005Distributed systems: distributed algorithms 164 Related problems & solutions Basic solutions: –Consensus: C i (v 1, v 2,..., v N ) = decision value of p i –Byzantine generals j is commander, proposing v BG i (j, v) = decision value of p i –Interactive consensus IC i (v 1, v 2,..., v N )[j] = jth value in the decision vector of p i with v 1, v 2,..., v N values proposed by processes
165
November 2005Distributed systems: distributed algorithms 165 Related problems & solutions Derived solutions: –IC from BG: Run BG N times once with each process as commander IC i (v 1, v 2,..., v N )[j] = BG(j, v j ) –C from IC: Run IC and produce vector of values Derive single value with appropriate function –BG from C:
166
November 2005Distributed systems: distributed algorithms 166 Related problems & solutions Derived solutions: –IC from BG: –C from IC: –BG from C: Commander p j sends value v to itself & other processes Each process runs C with the values v 1, v 2,...v N they received BG i (j,v) = C i (v 1, v 2,...v N )
167
November 2005Distributed systems: distributed algorithms 167 Consensus in a synchronous system Assumptions –Use B-multicast –f of N processes may fail Approach –Proceed in f + 1 rounds –In each round correct processes B-multicast values Variables –Values i r = set of proposed values known to process p i at the beginning of round r
168
November 2005Distributed systems: distributed algorithms 168 Consensus in a synchronous system
169
November 2005Distributed systems: distributed algorithms 169 Consensus in a synchronous system Termination? –Synchronous system Correct? –Each process arrives at the same set of values at the end of the final round –Proof? Assume sets are different... P i has value v & P j doesn’t have value v P k : managed to send v to P i and not to P j Agreement & Integrity? –Processes apply the same function
170
November 2005Distributed systems: distributed algorithms 170 Byzantine generals in a synchronous system Assumptions –Arbitrary failures –f of N processes may be faulty –Channels are reliable: no message injections –Unsigned messages
171
November 2005Distributed systems: distributed algorithms 171 Byzantine generals in a synchronous system Impossibility with 3 processes Impossibility with N <= 3f processes p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u p 1 (Commander) p 2 p 3 1:x1:w 2:1:w 3:1:x Faulty processes are shown shaded
172
November 2005Distributed systems: distributed algorithms 172 Byzantine generals in a synchronous system Solution with one faulty process –N = 4, f = 1 –2 rounds of messages: Commander sends its value to each of the lieutenants Each of the lieutenants sends the value it received to its peers –Lieutenant receives Value of commander N-2 values of peers Use majority function
173
November 2005Distributed systems: distributed algorithms 173 Byzantine generals in a synchronous system Solution with one faulty process p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u Faulty processes are shown shaded p 4 1:v 4:1:v 2:1:v3:1:w 4:1:v p 1 (Commander) p 2 p 3 1:w1:u 2:1:u 3:1:w p 4 1:v 4:1:v 2:1:u3:1:w 4:1:v
174
November 2005Distributed systems: distributed algorithms 174 Consensus & related problems Impossibility in asynchronous systems – proof: No algorithm exists to reach consensus –No guaranteed solution to Byzantine generals problem Interactive consistency Reliable & totally ordered multicast Approaches for work around: –Masking faults: restart crashed process and use persistent storage –Use failure detectors: make failure fail-silent by discarding messages
175
November 2005Distributed systems: distributed algorithms 175 This chapter: overview Introduction Logical clocks Global states Failure detectors Mutual exclusion Elections Multicast communication Consensus and related problems
176
November 2005Distributed systems: distributed algorithms 176 Distributed Systems: Distributed algorithms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.