Download presentation
Presentation is loading. Please wait.
Published byConrad Fisher Modified over 9 years ago
1
Fault tolerance and related issues in distributed computing Shmuel Zaks zaks@cs.technion.ac.il GSSI - Feb 2016192
2
Part 0: Part 0: An overview Part 1: Part 1: Lower bounds Part 2: Part 2: Computing in spite of faults Part 3: Part 3: Detecting faults Part 4: Part 4: Self-stabilization 193GSSI - Feb 2016
3
194 The snapshot algorithm (Candy and Lamport) GSSI - Feb 2016
4
195GSSI - Feb 2016
5
196GSSI - Feb 2016
6
197 Goal: design a snapshot (=global-state- detection) algorithm that: will record a collection of states of all system components (which forms a global system state), will not change the underlying computation, will not freeze the underlying computation GSSI - Feb 2016
7
198 A Process Can… record its own state, send and receive messages, record messages it sends and receives, cooperate with other processes Processes do not share clocks or memory Processes cannot record their state precisely at the same instant GSSI - Feb 2016
8
199 Motivation Many problems in distributed systems can be stated in terms of the problem of detecting global states: Stable property detection problems : termination detection, deadlock detection etc. GSSI - Feb 2016
9
200 Stable Property Detection Problem D - distributed system y - a predicate function defined on the set of global states of D S, S’ – global states of D y is stable if y(S) implies y(S’) for all S’ reachable from S GSSI - Feb 2016
10
many distributed algorithms are structured as a sequence of phases A phase: transient part, then a stable part phase termination vs. computation termination our view on the problem: i.detect the termination of a phase ii.initiate a new phase Notice that “the kth phase has terminated” is a stable property 201GSSI - Feb 2016
11
202 Model Distributed system D is a finite, labeled, directed graph. p q C2 C1 Channels have infinite buffers, are error- free and preserve FIFO Message delay is bounded, but unknown GSSI - Feb 2016
12
203 State of a Channel 1 p q C1 23 1 [1, 2, 3] – sequence X of messages that were sent [1] – sequence Y of received messages ( prefix of X ) [2, 3] – state of C1: X \ Y pq C2 C1 GSSI - Feb 2016
13
204 Example: System Distributed system: p C2C2 C1C1 Initial global state: B A Ø Ø State transitions (same for p and q): A B send receive q GSSI - Feb 2016
14
205 A A Ø A A Ø A B Ø Ø B A Ø Ø A computation corresponds to a path in the diagram p qq p p sends q receives q sends p receives q sends C1C1 p C2C2 q deterministic A B send receive Global state transition diagram GSSI - Feb 2016
15
206 Distributed system: State transition: p : q : CD send receive A B send receive p C2C2 C1C1 q Example: System GSSI - Feb 2016
16
207 qp C2C2 C1C1 A D Ø B C Ø B D A C Ø Ø pq q p p sends q sends p receives Global state transition diagram q receives non-deterministic q sends A B send receive CD send receive q receives GSSI - Feb 2016
17
208 Each process records its own state p and q cooperate to record the state of C. p C q in the snapshot algorithm: GSSI - Feb 2016
18
209 B A Ø p q Example: System A A A A Recorded state: p C q Ø No token C1C1 p C2C2 q A B send receive Record C Record q Record p GSSI - Feb 2016
19
210 B A Ø Ø p q Example: System B A A A Ø Recorded state: p C1C1 q Two tokens Record p Record C Record q C1C1 p C2C2 q A B send receive GSSI - Feb 2016
20
211 q will record the state of C q starts recording C after it records its state p C q p and q have to coordinate ; using a special marker q stops when receiving from p But: how does q know when to record its state? GSSI - Feb 2016
21
212 Who starts? We assume one process. The snapshot algorithm Hw: extend discussion + proof to any number of startes. GSSI - Feb 2016
22
Who will record the state of channel C? q How q knows when to stop recording? p sends right after it records its state, and before sending any other message q starts recording after it records its state (Intuition for the Algorithm) p C q 213 GSSI - Feb 2016
23
214 The snapshot algorithm Ends when q receives along C Starts when q records itself channel recording p C q Note : for any q p 0, the channel along which arrived first is recorded as GSSI - Feb 2016
24
215 p 0 starts. The snapshot algorithm p 0 recoreds its state, and then broadcasts. Shout-algorithm = PI (Propogation-of-information)= hot potato = … When q receives for the first time, it records its own state State recording GSSI - Feb 2016
25
216 1. record the state of p 2. send along c before sending any other message Marker-Receiving Rule for a process q if q’s state is not recorded: 1. record state; 2. record c’s state = ; else: c’s state is the sequence of messages received since q recorded its state The snapshot algorithm on receiving along channel c: Marker-Sending Rule for a process q GSSI - Feb 2016
26
Termination Assumption No marker remains forever in an input channel Claim: If the graph is strongly connected and at least one process records its state, then all processes will record their state in finite time Proof: by induction 217 GSSI - Feb 2016
27
218 The Recorded Global State State transition: p : q : C D send receive A B send receive p C2C2 C1C1 q Ex: System GSSI - Feb 2016
28
219 A D B C B D A C pqqp p sends q sends p receives A D qp C2C2 C1C1 A B send receive CD send receive A GSSI - Feb 2016
29
220 What did we get? GSSI - Feb 2016
30
221 Event e in process p is an atomic action: can change the state of p, and a state of at most one channel c incident on p (by sending/receiving message M along c ) e is defined by e = may occur in global state S if 1. the state of p in S is s. 2 a. if c is directed towards p: c ’s state has M in its head, and is deleted after applying e. b. if c is directed from p: c ’s state has M in its tail after applying e. 3. the state of p after applying e is s’. GSSI - Feb 2016
31
222 Process State and Global State A process: set of states, an initial state set of events A global state S : collection of process states and channel states initially, each process is in its initial state and all channels are empty next(S, e) is the global state after event e in applied to global state S GSSI - Feb 2016
32
223 Process State and Global State seq = (e i : i = 0…n) is a computation of the system iff e i may occur in S i, S i+1 = next(S i, e i ) (S 0 is the initial global state) GSSI - Feb 2016
33
224 seq = (e i : i ≥ 0) a distributed computation S i – the state of the system right before e i occurs S 0 – the initial state of the system S t – the state of the system at the termination of the algorithm S* - the recorded global state The Recorded Global State GSSI - Feb 2016
34
225 Definition Event e j is called pre-recording if e j is in a process p and p records its state after e j in seq. Event e j is called post-recording if e j is in a process p and p records its state before e j in seq. Assume that e j-1 is a post-recording event before Pre-recording event e j in seq. pre-recording post-recording GSSI - Feb 2016
35
226 Lemma: Proof: e j-1 occurs in p and e j in q, and q ≠p (since e j-1 is and e j is.) GSSI - Feb 2016 pre-recording post-recording
36
227 The only scenario that might prevent interchanging the two events is that a message M is sent at e j-1 and received at e j. but this cannot be possible: if M is sent at e j-1, then M is, so a marker was sent to q before M, so when it is received in e j q already recorded its state, so e j is,a, a contradiction! GSSI - Feb 2016
37
228 Hence, event e j can occur in global state S j-1. The state of process p is not altered by e j, hence e j-1 can occur after e j. GSSI - Feb 2016
38
229 We have to show that the states of all Processes and channels are the same in S 2 and S 4. This clearly holds for proceses and channels That do not take part in ej-1 and ej. GSSI - Feb 2016
39
230 states: the states of p and q in S2 and in S4 are the same. channels: whether ej-1/ej send/receive(/neither) a message along a channel, the same is done in both scenarios, So the states of the channels in S 2 and S 4 are the same. (End of proof. ) GSSI - Feb 2016
40
(The Recorded Global State) GSSI - Feb 2016231
41
232 Proof Using the lemma, swap the events till all events appear after all events. The acquired computation is seq’. All that is left to show: S* is a global state after all events and before all events. 1.Process states 2.Channel states GSSI - Feb 2016
42
233 Claim: The state of a channel in S* is (sequence of messages corresp. to pre-recorded receives)-(sequence of messages corresp. to prerecorded sends) Proof: The state of channel c from process p to process q recorded in S* is the sequence of messages received on c by q after q records its state and before q receives a marker on c. The sequence of messages sent by p is the sequence corres. to prerecording sends on c. GSSI - Feb 2016
43
234 A D B C D A C pq q p p sends q sends p receives A D B post pre post qp C2C2 C1C1 A B send receive CD send receive GSSI - Feb 2016
44
235 A D A D D A C p q q p q sends p sends p receives A D A (Another execution) pre post B qp C2C2 C1C1 A B send receive CD send receive GSSI - Feb 2016
45
What did we get? A configuration that could have happened 236GSSI - Feb 2016
46
seq = (e i : i ≥ 0) a distributed computation S i – the state of the system right before e i occurs S 0 – the initial state of the system S t – the state of the system at the termination of the algorithm S* - the recorded global state 237GSSI - Feb 2016
47
Stable Detection D - distributed system y - a predicate function defined on the set of global states of D S, S’ – global states of D y is a stable property of D if y(S) implies y(S’) for all S’ reachable from S 238GSSI - Feb 2016
48
239 Input: A stable property y Output: a boolean value b with the property: y(S 0 ) b and b y(S t ) Algorithm Algorithm: begin record a global state S* b := y(S*) end GSSI - Feb 2016
49
240 Correctness 1. S* is reachable from S 0 2. S t is reachable from S* 3. y(S) y(S’) for all S’ reachable from S S 0 S* S t y(S*)=true y(S t )=true y(S*)=false y(S 0 )=false GSSI - Feb 2016
50
References K. M. Chandy and L. Lamport, Distributed Snapshots: Determining Global States of Distributed, ACM Trans. on Computer Systems, 1985. 241GSSI - Feb 2016
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.