Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.

Slides:

Advertisements

Similar presentations

Advertisements

How to Schedule a Cascade in an Arbitrary Graph F. Chierchetti, J. Kleinberg, A. Panconesi February 2012 Presented by Emrah Cem 7301 – Advances in Social.

Exact Inference in Bayes Nets

1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.

Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.

Self-stabilizing Distributed Systems Sukumar Ghosh Professor, Department of Computer Science University of Iowa.

Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.

NL equals coNL Section 8.6 Giorgi Japaridze Theory of Computability.

Copyright © Cengage Learning. All rights reserved. CHAPTER 5 SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION SEQUENCES, MATHEMATICAL INDUCTION, AND RECURSION.

Fast Leader (Full) Recovery despite Dynamic Faults Ajoy K. Datta Stéphane Devismes Lawrence L. Larmore Sébastien Tixeuil.

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Self Stabilization 1.

Lecture 4: Elections, Reset Anish Arora CSE 763 Notes include material from Dr. Jeff Brumfield.

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©

Chapter 4 - Self-Stabilizing Algorithms for Model Conservation4-1 Chapter 4: roadmap 4.1 Token Passing: Converting a Central Daemon to read/write 4.2 Data-Link.

LSRP: Local Stabilization in Shortest Path Routing Hongwei Zhang and Anish Arora Presented by Aviv Zohar.

Parallel Routing Bruce, Chiu-Wing Sham. Overview Background Routing in parallel computers Routing in hypercube network –Bit-fixing routing algorithm –Randomized.

1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:

CPSC 668Self Stabilization1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.

LSRP: Local Stabilization in Shortest Path Routing Anish Arora Hongwei Zhang.

CS294, YelickSelf Stabilizing, p1 CS Self-Stabilizing Systems

Outline Max Flow Algorithm Model of Computation Proposed Algorithm Self Stabilization Contribution 1 A self-stabilizing algorithm for the maximum flow.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.

On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.

Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.

Selected topics in distributed computing Shmuel Zaks

The importance of sequences and infinite series in calculus stems from Newton’s idea of representing functions as sums of infinite series.  For instance,

On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.

CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.

Chapter 3. Pitfalls Initialization Ambiguity in an iteration

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.

Defining Programs, Specifications, fault-tolerance, etc.

1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.

Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.

The Complexity of Distributed Algorithms. Common measures Space complexity How much space is needed per process to run an algorithm? (measured in terms.

1 Leader Election in Rings. 2 A Ring Network Sense of direction left right.

Autonomic distributed systems. 2 Think about this Human population x10 9 computer population.

CS294, Yelick Consensus revisited, p1 CS Consensus Revisited

Stabilization Presented by Xiaozhou David Zhu. Contents What-is Motivation 3 Definitions An Example Refinements Reference.

Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.

University of Iowa1 Self-stabilization. The University of Iowa2 Man vs. machine: fact 1 An average household in the developed countries has 50+ processors.

Hwajung Lee. Well, you need to capture the notions of atomicity, non-determinism, fairness etc. These concepts are not built into languages like JAVA,

Hwajung Lee. Why do we need these? Don’t we already know a lot about programming? Well, you need to capture the notions of atomicity, non-determinism,

Self-stabilization. What is Self-stabilization? Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

CS 542: Topics in Distributed Systems Self-Stabilization.

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch CSCE 668 Set 3: Leader Election in Rings 1.

1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.

Self-stabilization. Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward error recovery). Guarantees.

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb

Hwajung Lee.  Technique for spontaneous healing.  Forward error recovery.  Guarantees eventual safety following failures. Feasibility demonstrated.

Program Correctness. The designer of a distributed system has the responsibility of certifying the correctness of the system before users start using.

Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.

1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.

1 Low Latency Multimedia Broadcast in Multi-Rate Wireless Meshes Chun Tung Chou, Archan Misra Proc. 1st IEEE Workshop on Wireless Mesh Networks (WIMESH),

ITEC452 Distributed Computing Lecture 15 Self-stabilization Hwajung Lee.

Design of Tree Algorithm Objectives –Learning about satisfying safety and liveness of a distributed program –Apply the method of utilizing invariants and.

Self-stabilizing Overlay Networks Sukumar Ghosh University of Iowa Work in progress. Jointly with Andrew Berns and Sriram Pemmaraju (Talk at Michigan Technological.

Jordan Adamek Mikhail Nesterenko Sébastien Tixeuil

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Maximal Independent Set

Alternating Bit Protocol

Student: Fang Hui Supervisor: Teo Yong Meng

CS60002: Distributed Systems

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Applied Discrete Mathematics Week 12: Discrete Probability

Switching Lemmas and Proof Complexity

Presentation transcript:

Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa

Preview Weak stabilization (Gouda 2001) guarantees reachability of the legal configuration from any configuration. closure of the legal configuration under system action Once “stable”, if there is a minor perturbation, no recovery guarantee exists under a weakly fair scheduler, let alone “efficient recovery”. We take a weakly stabilizing leader election algorithm, and add fault-containment to it.

Our contributions An exercise in adding fault-containment to a weakly stabilizing leader election algorithm on a line topology. Processes are anonymous. Expected recovery time from all single failures is O(1) Lim m  ∞ (contamination number) is O(1) (precisely 4), where m is a tuning parameter (Contamination number = max. no. of non-faulty processes that change their states during recovery)

The big picture leader

Model and Notations Consider n processes in a line topology N(i) = neighbors of process i Variable P(i) = {N(i) U ⊥ } (parent of i) Macro C(i) = {q ∈ N(i): P(q) = i} (children of i) Predicate Leader(i) ≡ (P(i)= ⊥ ) Legal configuration: 1.For exactly one process i: P(i) = ⊥ 2.  j ≠ i: P(j) = k  P(k) ≠ j Node i P(i) C(i) Leader

Model and Notations Shared memory model and central scheduler Weak fairness of the scheduler Guarded action by a process: g  A Computation is a sequence of (global) states and state transitions Node i P(i) C(i) Leader

Stabilization A stable (or legal) configuration satisfies a predicate LC defined in terms of the primary variables p that are observable by the application. However, fault-containment often needs the use of secondary variables (a.k.a auxiliary or state variables) s. Thus, Local state of process i = (p i, s i ) Global state of the system = (p, s), where p = the set of all p i, and s = the set of all s i (p, s)  LC  p  LC p and s  LC s

Definitions Containment time is the maximum time needed to establish LC p from a 1-faulty configuration Containment in space means the primary variables of O(1) processes changing their state during recovery from any 1- faulty configuration Fault-gap is the time to reach LC (both LC p and LC S ) from any 1-faulty configuration LC p restored LC s restored Fault gap

Weakly stabilizing leader election We start from the weakly stabilizing leader election algorithm by Devismes, Tixeuil,Yamashita [ICDCS 2007], and then modify it to add fault-containment. Here is the DTY algorithm for an array of processes. DTY algorithm: Program for any process in the array Guarded actions: R1 :: not leader ∧ N(i) = C(i)→ be a leader R2 :: not leader ∧ N(i) \ {C(i) U P(i)} ≠  → switch parent R3 :: leader ∧ N(i) ≠ C(i) → parent := k : k  C(i)

Recovery from a single failure With a randomized scheduler, weakly stabilizing systems recover to a legal configuration with probability 1. However, If a single failure occurs, the recovery time can be as large as n (consider situations similar to Gambler’s ruin). For fault-containment, we need something better.

Our strategy We bias a randomized scheduler to achieve our goal. The technique was first illustrated in [Dasgupta, Ghosh, Xiao: SSS 2007]. Here we show that the technique is indeed powerful enough to solve a larger class of problems.

Biasing a random scheduler For fault-containment, each process i uses a secondary variable x(i). A node i updates its primary variable P(i) when the following conditions hold: 1.The guard involving the primary variables is true 2.The randomized scheduler chooses i 3.x(i) ≥ x(k), where k  N(i)

Biasing a random scheduler i jk x(i)=10x(k)=7x(j)=8 ijk x(i)=13x(k)=7x(j)=8 (Let m = 5) ijk x(i)=10x(k)=7x(j)=8 ijk x(i)=10x(k)=8x(j)=8 After the action, x(i) is incremented as x(i) := max q ∈ N(i) x(q) + m, m ∈ Z+ (call it update x(i), here m is a tuning parameter). When x(i) < x(k) but conditions 1-2 hold, the primary variable P(i) remains unchanged -- only x(i) is incremented by 1 UPDATE x(i) INCREMENT x(i) ** **

The Algorithm Algorithm 1 (containment) : program for process i Guarded actions: R1 :: (P(i) ≠ ⊥ ) ∧ (N(i) = C(i)) → P(i) := ⊥ R2 ::(P(i) = ⊥ ) ∧ ( ∃ k ∈ N(i) \ C(i)) → P(i) := k R3a ::(P(i) = j) ∧ (P(j) ≠ ⊥ ) ∧ ( ∃ k ∈ N(i) : P(k) ≠ i or ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k; update x(i) R3b ::(P(i) = j) ∧ (P(j) ≠ ⊥ ) ∧ ( ∃ k ∈ N(i) : P(k) ≠ i or ⊥ ) ∧ x(i) < x(k) → increment x(i) R4a :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k R4b :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) < x(k) → increment x(i) R5 :: (P(i) = j) ∧ (P(j) = ⊥ ) ∧ ( ∃ k ∈ N(i) : P (k) ≠ i or ⊥ } → P(i) := k

Analysis of containment Consider six cases 1. Fault at the leader 2. Fault at distance-1 from the leader 3. Fault at distance-2 from the leader 4. Fault at distance-3 from the leader 5. Fault at distance-4 from the leader 6. Fault at distance-5 or greater from the leader

Case 1: fault at leader node R1 applied by node 5 R1 applied by node 4: node 4 is the new leader ** R1 :: (P(i) ≠ ⊥ ) ∧ (N(i) = C(i)) → P(i) := ⊥

Case 2: fault at distance-1 from the leader node R1: node R2: node 5 ** ** * R2 :: (P(i) = ⊥ ) ∧ ( ∃ k ∈ N(i) \ C(i)) → P(i) := k

Case 5: fault at distance-4 from the leader node ** ** R4a(2): x(2)>x(1) R5 (4) ** * R2(5) * R3a(3): x(3)>x(2) stable Non-faulty processes up to distance 4 from the faulty node being affected R4a :: (P(i) = j) ∧ ( ∃ k ∈ N(i) : P(k) = ⊥ ) ∧ x(i) ≥ x(k) → P(i) := k

Case 6: fault at distance ≥ 5 from the leader node ** ** R4a(2): x(2)>x(1) R3a (3); R5 (2) ** * R2 (1) * R3a(3): x(3)>x(2), x(4) With a high m, it is difficult for 4 to change its parent, but 3 can easily do it Recovery complete Current leader

Fault-containment in space Theorem 1. As m  ∞, the eﬀect of a single failure is restricted within distance-4 from the faulty process i.e., algorithm is spatially fault-containing. Proof idea. Uses the exhaustive case-by-case analysis. The worst case occurs when a node at distance-4 from the leader node fails as shown earlier.

Fault-containment in time Theorem 2. The expected number of steps needed to contain a single fault is independent of n. Hence algorithm containment is fault-containing in time. Proof idea. Case by case analysis. When a node beyond distance-4 from the leader fails, its impact on the time complexity remains unchanged. A summary of these calculation follows:

Fault-containment in time ** Recovery completed in a single move regardless of whether node 3 or 4 executes a move. C ase 1 : leader fails C ase 2 : A node i at distance -1 from the leader fails. (a) P(i) becomes ⊥ : recovery completed in one step (b) P(i) switches to a new parent: recovery time = 2 +∑ ∞ n=1 n/2 n = 4

Fault-containment in time Summary of expected containment times Fault at leader-1 Fault at dist-114 Fault at dist-22151/108 Fault at dist-3131/54115/36 Fault at dist-410/929/27 Fault at dist ≥ 433/32115/36 P(i)  ⊥ P(i) switches Thus, the expected containment time is O(1)

Proof idea of weak stabilization DTY algorithmOur algorithm R1 R2 R3 R1 R2 R3 R4 R5 Executes the same action (P(i) :=k) as in DTY, but the guards are biased differently Equivalent to adding “different delays” in different paths Every computation in our algorithm is a computation of the DTY algorithm too. Since DTY algorithm is weakly stabilizing, so is our algorithm

Stabilization from multiple failures Theorem 3. When m → ∞, the expected recovery time from multiple failures is O(1) if the faults occur at distance 9 or more apart. Proof sketch. Since the contamination number is 4, no non-faulty process is influenced by both failures. 44 Fault

Conclusion 1.With increasing m, the containment in space is tighter, but stabilization from arbitrary initial configurations slows down. 2.LC s = true, so the systems is ready to deal with the next single failure as soon as LC p holds. This reduces the fault-gap and increases system availability. 3.The unbounded secondary variable x can be bounded using the technique discussed in [Dasgupta, Ghosh, Xiao SSS 2007] paper. 4.It is possible to extend this algorithm to a tree topology (but we did not do it here)

Questions?

Proof of convergence Theorem 3. The proposed algorithm recovers from all single faults to a legal configuration. Proof (Using martingale convergence theorem) A martingale is a sequence of random variables X 1, X 2, X 3, … s.t. ∀ n 1.E(|X n |) < ∞, and 2.E(X n+1 |X 1 … X n ) = X n (for super-martingale use ≤ for =, and for sub-martingale, use ≥ for =) We use the following corollary of Martingale convergence theorem: Corollary. If X n ≥ 0 is a super-martingale then as n → ∞, X n converges to X with probability 1, and E(X) ≤ E(X 0 ).

Proof of convergence (continued) Let X i be the number of processes with enabled guards in step i. After 0 or 1 failure, X can be 0, 2, or 3 (exhaustive enumeration). When X i = 0, X i+1 = 0 (already stable) When X i = 2, E(X i+1 )= 1/2 x 1 + 1/2 x 2 = 1 ≤ 2 When X i = 3, E(X i+1 )= 1/3 x 0 + 1/3 x 2 + 1/3 x 4 = 2 ≤ 3 Thus X 1, X 2, X 3, … is a super-martingale. Using the Corollary, as n → ∞, E(X n ) ≤ E(X 0 ). Since X is non-negative by definition, X n converges to 0 with probability 1, and the system stabilizes.