Download presentation
Presentation is loading. Please wait.
Published byCaroline Adams Modified over 9 years ago
1
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa
2
Preview Weak stabilization (Gouda 2001) guarantees reachability of the legal configuration from any configuration. closure of the legal configuration under system action Once “stable”, if there is a minor perturbation, no recovery guarantee exists under a weakly fair scheduler, let alone “efficient recovery”. We take a weakly stabilizing leader election algorithm, and add fault-containment to it.
3
Our contributions An exercise in adding fault-containment to a weakly stabilizing leader election algorithm on a line topology. Processes are anonymous. Expected recovery time from all single failures is O(1) Lim m ∞ (contamination number) is O(1) (precisely 4), where m is a tuning parameter (Contamination number = max. no. of non-faulty processes that change their states during recovery)
4
The big picture leader
5
Model and Notations Consider n processes in a line topology N(i) = neighbors of process i Variable P(i) = {N(i) U ⊥ } (parent of i) Macro C(i) = {q ∈ N(i): P(q) = i} (children of i) Predicate Leader(i) ≡ (P(i)= ⊥ ) Legal configuration: 1.For exactly one process i: P(i) = ⊥ 2. j ≠ i: P(j) = k P(k) ≠ j Node i P(i) C(i) Leader
6
Model and Notations Shared memory model and central scheduler Weak fairness of the scheduler Guarded action by a process: g A Computation is a sequence of (global) states and state transitions Node i P(i) C(i) Leader
7
Stabilization A stable (or legal) configuration satisfies a predicate LC defined in terms of the primary variables p that are observable by the application. However, fault-containment often needs the use of secondary variables (a.k.a auxiliary or state variables) s. Thus, Local state of process i = (p i, s i ) Global state of the system = (p, s), where p = the set of all p i, and s = the set of all s i (p, s) LC p LC p and s LC s
8
Definitions Containment time is the maximum time needed to establish LC p from a 1-faulty configuration Containment in space means the primary variables of O(1) processes changing their state during recovery from any 1- faulty configuration Fault-gap is the time to reach LC (both LC p and LC S ) from any 1-faulty configuration LC p restored LC s restored Fault gap
9
Weakly stabilizing leader election We start from the weakly stabilizing leader election algorithm by Devismes, Tixeuil,Yamashita [ICDCS 2007], and then modify it to add fault-containment. Here is the DTY algorithm for an array of processes. DTY algorithm: Program for any process in the array Guarded actions: R1 :: not leader ∧ N(i) = C(i)→ be a leader R2 :: not leader ∧ N(i) \ {C(i) U P(i)} ≠ → switch parent R3 :: leader ∧ N(i) ≠ C(i) → parent := k : k C(i)
10
Recovery from a single failure With a randomized scheduler, weakly stabilizing systems recover to a legal configuration with probability 1. However, If a single failure occurs, the recovery time can be as large as n (consider situations similar to Gambler’s ruin). For fault-containment, we need something better.
11
Our strategy We bias a randomized scheduler to achieve our goal. The technique was first illustrated in [Dasgupta, Ghosh, Xiao: SSS 2007]. Here we show that the technique is indeed powerful enough to solve a larger class of problems.
12
Biasing a random scheduler For fault-containment, each process i uses a secondary variable x(i). A node i updates its primary variable P(i) when the following conditions hold: 1.The guard involving the primary variables is true 2.The randomized scheduler chooses i 3.x(i) ≥ x(k), where k N(i)
13
Biasing a random scheduler i jk x(i)=10x(k)=7x(j)=8 ijk x(i)=13x(k)=7x(j)=8 (Let m = 5) ijk x(i)=10x(k)=7x(j)=8 ijk x(i)=10x(k)=8x(j)=8 After the action, x(i) is incremented as x(i) := max q ∈ N(i) x(q) + m, m ∈ Z+ (call it update x(i), here m is a tuning parameter). When x(i) < x(k) but conditions 1-2 hold, the primary variable P(i) remains unchanged -- only x(i) is incremented by 1 UPDATE x(i) INCREMENT x(i) ** **
14
The Algorithm Algorithm 1 (containment) : program for process i Guarded actions: R1 :: (P(i) ≠ ⊥ ) ∧ (N(i) = C(i)) → P(i) := ⊥ R2 ::(P(i) = ⊥ ) ∧ ( ∃ k ∈ N(i) \ C(i)) → P(i) := k R3a ::(P(i) = j) ∧ (P(j) ≠ ⊥ ) ∧ ( ∃ k ∈ N(i) : P(k) ≠ i or ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k; update x(i) R3b ::(P(i) = j) ∧ (P(j) ≠ ⊥ ) ∧ ( ∃ k ∈ N(i) : P(k) ≠ i or ⊥ ) ∧ x(i) < x(k) → increment x(i) R4a :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k R4b :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) < x(k) → increment x(i) R5 :: (P(i) = j) ∧ (P(j) = ⊥ ) ∧ ( ∃ k ∈ N(i) : P (k) ≠ i or ⊥ } → P(i) := k
15
Analysis of containment Consider six cases 1. Fault at the leader 2. Fault at distance-1 from the leader 3. Fault at distance-2 from the leader 4. Fault at distance-3 from the leader 5. Fault at distance-4 from the leader 6. Fault at distance-5 or greater from the leader
16
Case 1: fault at leader node 01726543 01726543 01726534 R1 applied by node 5 R1 applied by node 4: node 4 is the new leader ** R1 :: (P(i) ≠ ⊥ ) ∧ (N(i) = C(i)) → P(i) := ⊥
17
Case 2: fault at distance-1 from the leader node 01726543 01726543 R1: node 3 01726543 R2: node 5 ** ** * R2 :: (P(i) = ⊥ ) ∧ ( ∃ k ∈ N(i) \ C(i)) → P(i) := k
18
Case 5: fault at distance-4 from the leader node 01726543 ** 01726543 ** R4a(2): x(2)>x(1) 01726543 R5 (4) ** * R2(5) * R3a(3): x(3)>x(2) 01726543 01726543 stable Non-faulty processes up to distance 4 from the faulty node being affected R4a :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k
19
Case 6: fault at distance ≥ 5 from the leader node 01726543 ** 01726543 ** R4a(2): x(2)>x(1) 01726543 R3a (3); R5 (2) ** * R2 (1) * R3a(3): x(3)>x(2), x(4) 01726543 With a high m, it is difficult for 4 to change its parent, but 3 can easily do it Recovery complete 01726543 Current leader
20
Fault-containment in space Theorem 1. As m ∞, the effect of a single failure is restricted within distance-4 from the faulty process i.e., algorithm is spatially fault-containing. Proof idea. Uses the exhaustive case-by-case analysis. The worst case occurs when a node at distance-4 from the leader node fails as shown earlier.
21
Fault-containment in time Theorem 2. The expected number of steps needed to contain a single fault is independent of n. Hence algorithm containment is fault-containing in time. Proof idea. Case by case analysis. When a node beyond distance-4 from the leader fails, its impact on the time complexity remains unchanged. A summary of these calculation follows:
22
Fault-containment in time 01726543 ** Recovery completed in a single move regardless of whether node 3 or 4 executes a move. C ase 1 : leader fails C ase 2 : A node i at distance -1 from the leader fails. (a) P(i) becomes ⊥ : recovery completed in one step (b) P(i) switches to a new parent: recovery time = 2 +∑ ∞ n=1 n/2 n = 4
23
Fault-containment in time Summary of expected containment times Fault at leader-1 Fault at dist-114 Fault at dist-22151/108 Fault at dist-3131/54115/36 Fault at dist-410/929/27 Fault at dist ≥ 433/32115/36 P(i) ⊥ P(i) switches Thus, the expected containment time is O(1)
24
Proof idea of weak stabilization DTY algorithmOur algorithm R1 R2 R3 R1 R2 R3 R4 R5 Executes the same action (P(i) :=k) as in DTY, but the guards are biased differently Equivalent to adding “different delays” in different paths Every computation in our algorithm is a computation of the DTY algorithm too. Since DTY algorithm is weakly stabilizing, so is our algorithm
25
Stabilization from multiple failures Theorem 3. When m → ∞, the expected recovery time from multiple failures is O(1) if the faults occur at distance 9 or more apart. Proof sketch. Since the contamination number is 4, no non-faulty process is influenced by both failures. 44 Fault
26
Conclusion 1.With increasing m, the containment in space is tighter, but stabilization from arbitrary initial configurations slows down. 2.LC s = true, so the systems is ready to deal with the next single failure as soon as LC p holds. This reduces the fault-gap and increases system availability. 3.The unbounded secondary variable x can be bounded using the technique discussed in [Dasgupta, Ghosh, Xiao SSS 2007] paper. 4.It is possible to extend this algorithm to a tree topology (but we did not do it here)
27
Questions?
28
Proof of convergence Theorem 3. The proposed algorithm recovers from all single faults to a legal configuration. Proof (Using martingale convergence theorem) A martingale is a sequence of random variables X 1, X 2, X 3, … s.t. ∀ n 1.E(|X n |) < ∞, and 2.E(X n+1 |X 1 … X n ) = X n (for super-martingale use ≤ for =, and for sub-martingale, use ≥ for =) We use the following corollary of Martingale convergence theorem: Corollary. If X n ≥ 0 is a super-martingale then as n → ∞, X n converges to X with probability 1, and E(X) ≤ E(X 0 ).
29
Proof of convergence (continued) Let X i be the number of processes with enabled guards in step i. After 0 or 1 failure, X can be 0, 2, or 3 (exhaustive enumeration). When X i = 0, X i+1 = 0 (already stable) When X i = 2, E(X i+1 )= 1/2 x 1 + 1/2 x 2 = 1 ≤ 2 When X i = 3, E(X i+1 )= 1/3 x 0 + 1/3 x 2 + 1/3 x 4 = 2 ≤ 3 Thus X 1, X 2, X 3, … is a super-martingale. Using the Corollary, as n → ∞, E(X n ) ≤ E(X 0 ). Since X is non-negative by definition, X n converges to 0 with probability 1, and the system stabilizes.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.