Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016242.

Fault tolerance and related issues in distributed computing Shmuel Zaks zaks@cs.technion.ac.il GSSI - Feb 2016242

Part 0: Part 0: An overview Part 1: Part 1: Lower bounds Part 2: Part 2: Computing in spite of faults Part 3: Part 3: Detecting faults Part 4: Part 4: Self-stabilization GSSI - Feb 2016243

GSSI - Feb 2016244 shared memory, synchronous 6 6 6 6 6 x x x x x 7 7 7 7 7 8 8 8 8 8 x x x x x clock synchronization A. An example: Unison

Program at each processor p : choose a neighbor x of p ; if t(p) = t(x) then t(p):= t(p)+1; GSSI - Feb 2016245 Stabilizing if all start with the same time, however …

GSSI - Feb 2016246 67 6 4 7 Is it stabilizing? no! (may even deadlock) if t(p) = t(x) then t(p):= t(p)+1;

Program at each processor p : choose a neighbor x of p ; if t(p) = t(x) then t(p):= t(p)+1; GSSI - Feb 2016247 Is it stabilizing? if t(p) < t(x) then t(p):= t(x);

GSSI - Feb 2016248 67 6 4 7 x 7 7 7 7 7 x x x x 8 8 x x 8 x 8 x x 8 Is it stabilizing?yes! if t(p) = t(x) then t(p):= t(p)+1; if t(p) < t(x) then t(p):= t(x);

GSSI - Feb 2016249 67 6 4 7 x 8 8 7 7 7 x x x x 9 9 x x 8 x 8 x x 8 Is it stabilizing?no! if t(p) = t(x) then t(p):= t(p)+1; if t(p) < t(x) then t(p):= t(x);

GSSI - Feb 2016250 Program at each processor p : choose a neighbor x of p if t(p) = t(x) then t(p):= t(p)+1; if t(p) < t(x) then t(p):= t(x); fairly; Is it stabilizing? yes!

GSSI - Feb 2016251 proof

GSSI - Feb 2016252 67 6 4 7 n=5, range(S)=7-4=3, top(S)=2 S:

GSSI - Feb 2016253 67 6 4 7 n=5, range(S 1 )=7-7=0, top(S 1 )=5 S1:S1: 7 7 7 7 7 x x x x x

GSSI - Feb 2016254 67 6 4 7 n=5, range(S 2 )=8-7=1, top(S 2 )=2 S2:S2: 7 7 7 8 8 x x x x x

GSSI - Feb 2016255 67 6 4 7 n=5, range(S 3 )=9-8=1, top(S 3 )=2 S3:S3: 7 7 7 8 8 x x x x x 9 8 8 8 9 x x x x x

GSSI - Feb 2016256 67 6 4 7 Use for: correctness (especially termination) + complexity S :

a special kind of fault tolerance guarantees an automatic recovery failures Started by Dijkstra. Introduced with the problem of mutual exclusion on a ring of processors Simple case: Distributed system. Steps taken one at a time, determined by some underlying mechanism. GSSI - Feb 2016257 b. Self stabilization

258 Bidirectional Ring Shared Memory: x i Configuration: (x 0, …, x n–1 ) p0p0 p1p1 x n–1 p i–1... xixi x i+1 x1x1 x0x0 x i–1 pipi p i+1 p n–1 GSSI - Feb 2016

259 Not legitimate configuration = many tokens p0p0 p1p1 x n–1 p i–1... xixi x i+1 x1x1 x0x0 x i–1 pipi p i+1 p n–1 GSSI - Feb 2016

260 Legitimate configuration = one token p0p0 p1p1 x n–1 p i–1... xixi x i+1 x1x1 x0x0 x i–1 pipi p i+1 p n–1 Stabilized ! GSSI - Feb 2016

261 p0p0 p1p1 x n–1 p i–1... xixi x i+1 x1x1 x0x0 x i–1 pipi p i+1 p n–1 IF condition THEN change_state END. p is privileged ↔ condition = TRUE ↔ p holds GSSI - Feb 2016

262 2 0 1 021 2 2

A new approach to fault tolerance A unified approach to formally incorporate failures into the design model GSSI - Feb 2016263

Applications Distributed data structures Digital and analog circuits Genetic algorithms Network protocols Sensor networks GSSI - Feb 2016264 Some general comments on self stabilization

265 S t a b i l i z e d ! Self-stabilizing System Any configuration Correct output Legitimate configuration Noise Valid execution GSSI - Feb 2016

266 E W Dijkstra. Self-stabilization in Spite of Distributed Control, 1973 Again I beg my intrigued readers to stop reading here and to try to solve the stated problem themselves, for only then they will (slowly!) build up some sympathy with my difficulties: the problem has been with me for many months, while I was oscillating between trying to find a solution ― and many an at first sight plausible construction turned out to be wrong! ― and trying to prove the non-existence of a solution. And all the time I had no indication in which of the two directions to aim, nor of the simplicity or complexity of argument ― if any! ― that would settle the question. GSSI - Feb 2016

267 E W Dijkstra. Self-stabilizing Systems in Spite of Distributed Control. Communications of the ACM, 1974 For brevity’s sake most of the heuristics that lead me to find them and the proofs that they satisfy the requirements have been omitted and ― to quote Douglas T.Ross’s comment on an earlier draft ― “the appreciation is left as an exercise for the reader”. GSSI - Feb 2016

268 That same summer I designed the first so-called "self- stabilizing systems". It was a new problem of which no one knew whether it could be solved at all, and the solutions were at the time a tour de force. I submitted a short article to the Communications of the ACM, which published it in 1974; it was ignored until Leslie Lamport drew attention to it in 1984, and since then it has added a new dimension to ways of dealing with error recovery. E W Dijkstra From My Life, EWD1166, 1993 GSSI - Feb 2016

269 E W Dijkstra. A Belated Proof of Self-stabilization. Distributed Computing, 1986 GSSI - Feb 2016

Dijkstra’s 1 st algorithm N processors 0,1,…,N-1 Solution with k-state Machines machine P 0 if x[0] = x[N-1] then x[0] := x[0]+1 mod k machine P i ( i= 1,2,…,N-1) if x[i] ≠ x[i-1] then x[i] := x[i-1] GSSI - Feb 2016270 Q: Do we need distint machines?

Uniform protocol : same program at each processor Theorem: There is no uniform protocol that solves the mutual exclusion problem (under either a centralized or a distributed scheduler) GSSI - Feb 2016271

GSSI - Feb 2016 272 0 0 0 0 p0p0 p3p3 p1p1 p2p2 0 0 p6p6 p5p5 3 3 3 7 7 7

Uniform protocol : same program at each processor Theorem: There is no uniform protocol that solves the mutual exclusion problem (under either a centralized or a distributed scheduler). However, if n is a prime, it is possible to solve the problem under a centralized scheduler. GSSI - Feb 2016273

distributed control ring of (finite-state machine) processes for each machine define privileges (when a transition is enabled) nodes are influenced only by neighbors legitimate states in each legitimate state there is at least one privileged machine in each legitimate state each possible move will bring the system again to a legitimate state GSSI - Feb 2016274 c. Mutual exclusion – Dijkstra’s 1 st algorithms

A system is self-stabilizing if, regardless of the initial state and regardless of the privilege selected each time for the next move, at least one privilege will always be present and the system is guaranteed to find itself in a legitimate state after a finite number of moves. GSSI - Feb 2016275

GSSI - Feb 2016276 non-stable states stable states Finite number of transitions

Token ring. N machines, denoted 0 through N-1. A machine is priviliged (“has token”) or not priviliged. daemon/scheduler determines which of the priviliged machines will make the next move Legitimate States: all states with exactly one privileged machine. GSSI - Feb 2016277 In Dijkstra’s model:

Notes to consider: scheduler (daemon) : points at each time to one of the privileged machines, or point at each time to one of the machines. centralized scheduler, distributed scheduler fairness: Do we need fairness, and of what kind? Unconditional Fairness: Each machine gets pointed to infinitely often. Type of fairness: - each machine is pointed at infinitely often, - round robin, or - other. GSSI - Feb 2016278

Dijkstra’s 1 st algorithm N processors 0,1,…,N-1 Solution with k-state Machines machine P 0 if x[0] = x[N-1] then x[0] := x[0]+1 mod k machine P i ( i= 1,2,…,N-1) if x[i] ≠ x[i-1] then x[i] := x[i-1] GSSI - Feb 2016279

GSSI - Feb 2016280 2 0 1 0 p0p0 p3p3 p1p1 p2p2 N=4, K=3 State : 0,1,2

GSSI - Feb 2016281 2 0 1 0 p0p0 p3p3 p1p1 p2p2 machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k

GSSI - Feb 2016282 2 0 1 0 p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1]

GSSI - Feb 2016283 2 0 1 0 p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k So far:

GSSI - Feb 2016284 2 0 1 0 p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k 21 2 2 0 0 0 0 K=3 Etc… Example 1

GSSI - Feb 2016285 2 0 1 0 p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k 21 2 2 0 1 1 0 K=2 0110 0 0 1 Etc… Example 2

GSSI - Feb 2016286 2 0 1 0 p0p0 p3p3 p1p1 p2p2 machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k 21 2 2 0 1 1 0 K=2 011 1 0 0 0 1 1 0 01 0 0 1 Etc… Example 3

GSSI - Feb 2016287 Informally: “If machine is privileged, then make it non-privileged”. Note: machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k

GSSI - Feb 2016288 Legitimate States: exactly one privileged machine 2 0 1 0 p0p0 p3p3 p1p1 p2p2 21 2 2 1 1 1 0 K=2 1 1 machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k 0

Legitimate States GSSI - Feb 2016289 0 0 0 0 0 …… 1 0 0 0 0 1 1 0 0 0 0 0 K–1 K–1 K–1 …… 1 1 1 1 1 0 K-1 K–1 K–1 K–1 2 1 1 1 1 … 2 2 2 2 2 … K–1 K–1 K–1 K–1 K–1

GSSI - Feb 2016290 Goal: Show that from an arbitrary state, the system stabilizes to a legitimate state in a finite number of steps. Intuition: 0 1 2 3 4 0 2 4 1 0 1 2 4 1 0 1 2 4 1 1 2 2 4 1 1 2 2 2 1 1  0 1 2 3 4 0 3 2 1 0 1 3 2 1 0 1 1 2 1 0 1 1 2 2 0 1 1 2 2 2  N=5, K = 5 N=5, K = 4

GSSI - Feb 2016291 0 1 2 3 4 0 1 2 1 0 0 0 2 1 0 1 0 2 1 0 1 0 2 1 1 1 0 2 2 1 1 0 0 2 1 0 1 2 3 4 1 1 0 2 1 2 1 0 2 1 2 1 0 2 2 2 1 0 0 2 2 1 1 0 2 2 2 1 0 2 0 2 1 0 2 0 2 1 0 0 0 2 1 1 0 0 2 2 1 0 0 0 2 1 0 N=5, K = 3 hw: can you get to this configuration? hw: which configurations have the same property? (Check also for all other examples.)

GSSI - Feb 2016292 Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstra’s 1st algorithm stabilizes iff K > N-2.

Lemma 2: (liveness) At every configuration, at least one machine is privileged. GSSI - Feb 2016293 Lemma 3: Starting from any configuration, P 0 will eventually make a move. Corollary: Starting at any configuration, P 0 will make an infinite number of steps. (this still does not guarantee stabilization.) machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k Lemma 1: If all states are equal then the system is stabilized.

Definition: P i is called special in a configuration if GSSI - Feb 2016294 Lemma 4: If P 0 is special in a configuration, then the system will eventually stabilize. Lemma 5: If in a configuration one of the states is missing, then the system will eventually stabilize. machine P i : if x[i] ≠ x[i-1] then x[i] := x[i-1] machine P 0 : if x[0] = x[N-1] then x[0] := x[0]+1 mod k

GSSI - Feb 2016295 Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstra’s 1st algorithm stabilizes iff K > N-2. Proof: Case 1: K< N-1 Case 2: K = N-1 Case 3: K = N Case 4: K > N

GSSI - Feb 2016296 Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstra’s 1st algorithm stabilizes iff K > N-2. Proof: (e.g., N=4, k=2) Case 1. k < N-1

GSSI - Feb 2016297 Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstra’s 1st algorithm stabilizes iff K > N-2. Proof: Case 4. k > N … Lemma 5

GSSI - Feb 2016298 Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstra’s 1st algorithm stabilizes iff K > N-2. Proof: Case 3. k = N … Lemma 5 Either one state is missing, or all states are distinct. … Lemma 4

GSSI - Feb 2016299 Theorem: Assuming a centralized scheduler: starting with any initial configuration, Dijkstra’s 1st algorithm stabilizes iff K > N-2. Proof: Case 2. k = N - 1 … Lemma 5 Either one state is missing, or one state occurs twice.

GSSI - Feb 2016300 … Lemma 4Either P 0 is special, or x[0] = x[i] a. i = 1 b. 1<i<N-1 c. i=N-1 So, one state occurs twice. Example, N=7, K=6 55 20143 5 20 5 143 5 20143 5

GSSI - Feb 2016301 Case a. i=1 Case b. 1<i< N-1 a1. P 0,,P 1,P N-1 cannot make a move a2. after any other move of P j, 1<j<N-1, there is a missing state, and done by Lemma 5. b1. If P 0 makes a move we get to b2 or b3. b2. If P N-1 makes a move then there is a missing state, and done by Lemma 5. b3. If P j, 0<j<N-1, makes a move, then there is a missing state, and done by Lemma 5. 5 20 5 143 55 20143

GSSI - Feb 2016302 Case c. i=N-1 c1. P N-1 cannot make a move. c2. If P o makes a move (in case x[0]=x[1]-1) then we get to case b. c3. If P j makes a move, 0<j<N-1, then one state is missing, and done by Lemma 5. 5 20143 5

GSSI - Feb 2016303 Theorem: Assuming a distrubted scheduler: starting with any initial configuration, Dijkstra’s 1st algorithm stabilizes iff K > N-1. hw: prove

GSSI - Feb 2016304 S = a configuration (a system state) r(S) = number of maximal runs of equal states Example: S = 2222011000, f(S) = 4 r(S) + 1 if first(S) = last(S) f(S) = r(S) otherwise. Note: S is legitimate iff f(S) = 2. Use of a potential function

Show: If K > N and f(S) > 2, then f(S) never increases and is eventually decreased. Recall: A common technique to prove termination and correctness GSSI - Feb 2016305 Hw: 1. finish the proof. 2. use f to analyze complexity

N processors 0,1,…,N-1 Solution with 3-state Machines P 0: if x[0] +1=x[1] then x[0]:=x[0]-1 P N-1 : if (x[N-2] =x[0]) (x[N-1]≠ x[0]+1) then x[N-1]:=x[0]+1; P i ( i= 1,2,…,N-2) : if (x[i]+1=x[i-1]) (x[i]+1=x[i+1]) then x[i] := x[i]+1; GSSI - Feb 2016306 d. Mutual exclusion – Dijkstra’s 3 rd algorithms

GSSI - Feb 2016307 2 1 2 1 p0p0 p4p4 p1p1 p2p2 State : 0,1,2 p3p3 0

P 0: if x[0] +1=x[1] then x[0]:=x[0]-1 GSSI - Feb 2016308 2 1 2 1 p0p0 p4p4 p1p1 p2p2 p3p3 0

P N-1 : if (x[N-2] =x[0]) (x[N-1]≠ x[0]+1) then x[N-1]:=x[0]+1; GSSI - Feb 2016309 2 1 2 1 p0p0 p4p4 p1p1 p2p2 p3p3 0

P i ( i= 1,2,…,N-2) : if (x[i]+1=x[i-1]) (x[i]+1=x[i+1]) then x[i] := x[i]+1; GSSI - Feb 2016310 2 1 2 1 p0p0 p4p4 p1p1 p2p2 p3p3 0

GSSI - Feb 2016313 2 1 2 1 p0p0 p4p4 p1p1 p2p2 p3p3 0

GSSI - Feb 2016314 1 1 p0p0 p4p4 p1p1 p2p2 p3p3 1 If all are in the same state? if (x[N-2] =x[0]) & (x[N-1]≠ x[0]+1) then x[N-1]:=x[0]+1; 1 1 2 if (x[i]+1=x[i-1]) v (x[i]+1=x[i+1]) then x[i] := x[i]+1; 2 2 if x[0] +1=x[1] then x[0]:=x[0]-1 1 0 2

GSSI - Feb 2016315 0 1 2

For two neighbors GSSI - Feb 2016316

GSSI - Feb 2016317 s: 2 1 2 0 1 1 2 2 f(s) = number of + 2 x number of f(s) = 4 + 2 x 1 = 6

GSSI - Feb 2016318

Lemma 1: If there is a unique arrow in a configuration, then it moves back and forth infinitely. GSSI - Feb 2016319 Corollary: It suffices to show that from any configuration we eventually reach a configuration with a unique arrow. Lemma 2: (liveness) Starting from any initial configuration, at least one process is priveleged.

Lemma 3: Starting from any configuration, eventually. GSSI - Feb 2016320 Proof: if during an execution -. else - f=0, but then there is no arrow, and in the next step there is a unique arrow, and then apply Lemma 1.

GSSI - Feb 2016321 Lemma 4: Between every two successive moves of P N-1 there is at least one move of P 0. Proof: recall: the program for P N-1 : if (x[N-2] =x[0]) (x[N-1]≠ x[0]+1) then x[N-1]:=x[0]+1; After P N-1 makes a step x[N-1]=x[0]+1, And this condition makes P N-1 non-privileged. This condition can become true only after P 0 makes a step [ P 0: if x[0] +1=x[1] then x[0]:=x[0]-1 ].

GSSI - Feb 2016322 Lemma 5: A sequence of moves in which P 0 does not move is finite. (therefore – impossible by Lemma 2) Proof: by Lemma 4 – in such a sequence P N-1 makes at most one step. so, from some point only P i makes a steps,. steps 1,2,3,4,5. 1,2 – no change in arrows 3,4,5 – decrease no. of arrows..

GSSI - Feb 2016323 Lemma 5: A sequence of moves in which P 0 does not move is finite. (therefore – impossible by Lemma 2) Proof (cont.): from some point only steps 1 and 2. Follows from topology.

Theorem: Starting from any initial configuration, the system eventually stabilizes. GSSI - Feb 2016324 Proof: by Lemma 5 P 0 makes infinite no. of moves. P 0 … P 0 … P 0 … P 0 … P 0 … …

GSSI - Feb 2016325 Proof (cont.) P 0 … P 0 … P 0 … P 0 … P 0 … … phases “there exists a rightmost arrow, and it points to the right” Falsified in each phase. P 0 … P 0 … P 0 … P 0 … P 0 … … F T F T F T F T F T

GSSI - Feb 2016326 This is done only in steps 3, 4, 6 6 - ok so, assume only 3 and 4. in each phase: they have Δf=-3. 0 and 7 Δf<=2. others Δf=0. so, each phase f is decreased by at least 1. End of proof of correctness. What about complexity?

327 phases Divide an execution into phases f decreases 1 Function f decreases by 1 during every phase infinitely There are infinitely many phases stabilizes The system stabilizes f≤ 2 Function f becomes ≤ 2 Dijkstra’s Proof of Correctness GSSI - Feb 2016

328 Implies: Dijkstra’s Proof of Correctness Implies: t 0 = # phases ≤ 2nt 6 + t 7 ≤ t 3 + t 4 + t 5 ≤ t 7 + n t 0 + t 3 + t 4 + t 5 + t 6 + t 7 ≤ 7n 3n3n2n2n 2n2n ≤ 3n GSSI - Feb 2016

329 Linear Programming !!! For Dijkstra’s algorithm: For Dijkstra’s algorithm: Refined Analysis – Tighter Bounds GSSI - Feb 2016

330 For Dijkstra’s algorithm: For Dijkstra’s algorithm: Refined Analysis – Tighter Bounds Upper bound: Lower bound: GSSI - Feb 2016

331 For new algorithm: For new algorithm: Upper bound: Lower bound: Tight Refined Analysis – Tight Bounds GSSI - Feb 2016

332 Uniform protocol : same program at each processor Theorem: There is no uniform protocol that solves the mutual exclusion problem (under either a centralized or a distributed scheduler). However, if n is a prime, it is possible to solve the problem under a centralized scheduler. e. Uniform protocol when n is a prime

References GSSI - Feb 2016333 E. W. Dijkstra, Self-stabilizing systems in spite of distributed control, CACM, 17,11, 1974, pp. 643-644. E. W. Dijkstra, A belated proof of self-stabilization, Distributed Computing, 1, 1, 1986, pp. 5-6. M. G. Gouda and T. Herman, Stabilizing Unison, Department of Computer Science, 1990. V. Chernoy, M. Shalom and S. Zaks, A Self- stabilizing algorithm with tight bounds for mutual exclusion on a ring, DISC 2008.

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016242.

Similar presentations

Presentation on theme: "Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016242."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016242.

Similar presentations

Presentation on theme: "Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb 2016242."— Presentation transcript:

Similar presentations

About project

Feedback