Self-stabilizing Distributed Systems Sukumar Ghosh Professor, Department of Computer Science University of Iowa.

Slides:



Advertisements
Similar presentations
Teaser - Introduction to Distributed Computing
Advertisements

Leader Election Let G = (V,E) define the network topology. Each process i has a variable L(i) that defines the leader.  i,j  V  i,j are non-faulty.
Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.
Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Self Stabilization 1.
Lecture 4: Elections, Reset Anish Arora CSE 763 Notes include material from Dr. Jeff Brumfield.
UBE529 Distributed Algorithms Self Stabilization.
Università degli Studi dell’Aquila Academic Year 2009/2010 Course: Algorithms for Distributed Systems Instructor: Prof. Guido Proietti Time: Monday:
1 Maximal Independent Set. 2 Independent Set (IS): In a graph, any set of nodes that are not adjacent.
1 Complexity of Network Synchronization Raeda Naamnieh.
LSRP: Local Stabilization in Shortest Path Routing Hongwei Zhang and Anish Arora Presented by Aviv Zohar.
Tracking Murat Demirbas SUNY Buffalo. A Pursuer-Evader Game for Sensor Networks Murat Demirbas Anish Arora Mohamed Gouda.
CPSC 668Self Stabilization1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
LSRP: Local Stabilization in Shortest Path Routing Anish Arora Hongwei Zhang.
CS294, YelickSelf Stabilizing, p1 CS Self-Stabilizing Systems
Outline Max Flow Algorithm Model of Computation Proposed Algorithm Self Stabilization Contribution 1 A self-stabilizing algorithm for the maximum flow.
Self-Stabilization An Introduction Aly Farahat Ph.D. Student Automatic Software Design Lab Computer Science Department Michigan Technological University.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
GS 3 GS 3 : Scalable Self-configuration and Self-healing in Wireless Networks Hongwei Zhang & Anish Arora.
Chapter 7 - Local Stabilization1 Chapter 7 – Local Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Draft of January 2004 Shlomi Dolev, All.
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Selected topics in distributed computing Shmuel Zaks
Representing distributed algorithms Why do we need these? Don’t we already know a lot about programming? Well, you need to capture the notions of atomicity,
Leader Election Algorithms for Mobile Ad Hoc Networks Presented by: Joseph Gunawan.
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.
By J. Burns and J. Pachl Based on a presentation by Irina Shapira and Julia Mosin Uniform Self-Stabilization 1 P0P0 P1P1 P2P2 P3P3 P4P4 P5P5.
Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.
The Complexity of Distributed Algorithms. Common measures Space complexity How much space is needed per process to run an algorithm? (measured in terms.
Faults and fault-tolerance
Dissecting Self-* Properties Andrew Berns & Sukumar Ghosh University of Iowa.
Program correctness The State-transition model The set of global states = so x s1 x … x sm {sk is the set of local states of process k} S0 ---> S1 --->
Self Stabilizing Smoothing and Counting Maurice Herlihy, Brown University Srikanta Tirthapura, Iowa State University.
Autonomic distributed systems. 2 Think about this Human population x10 9 computer population.
Program correctness The State-transition model A global states S  s 0 x s 1 x … x s m {s k = set of local states of process k} S0  S1  S2  Each state.
Hwajung Lee. The State-transition model The set of global states = s 0 x s 1 x … x s m {s k is the set of local states of process k} S0  S1  S2  Each.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Stabilization Presented by Xiaozhou David Zhu. Contents What-is Motivation 3 Definitions An Example Refinements Reference.
Hwajung Lee. The State-transition model The set of global states = s 0 x s 1 x … x s m {s k is the set of local states of process k} S0  S1  S2  Each.
Fault Management in Mobile Ad-Hoc Networks by Tridib Mukherjee.
University of Iowa1 Self-stabilization. The University of Iowa2 Man vs. machine: fact 1 An average household in the developed countries has 50+ processors.
Leader Election (if we ignore the failure detection part)
Self-stabilization. What is Self-stabilization? Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward.
CS 542: Topics in Distributed Systems Self-Stabilization.
Self-stabilizing energy-efficient multicast for MANETs.
Hwajung Lee. Let G = (V,E) define the network topology. Each process i has a variable L(i) that defines the leader.   i,j  V  i,j are non-faulty ::
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
Self-stabilization. Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward error recovery). Guarantees.
Hwajung Lee.  Technique for spontaneous healing.  Forward error recovery.  Guarantees eventual safety following failures. Feasibility demonstrated.
Program Correctness. The designer of a distributed system has the responsibility of certifying the correctness of the system before users start using.
Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
ITEC452 Distributed Computing Lecture 15 Self-stabilization Hwajung Lee.
Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)
Computer Science 425/ECE 428/CSE 424 Distributed Systems (Fall 2009) Lecture 20 Self-Stabilization Reading: Chapter from Prof. Gosh’s book Klara Nahrstedt.
Leader Election Let G = (V,E) define the network topology. Each process i has a variable L(i) that defines the leader.  i,j  V  i,j are non-faulty ::
Proof of liveness: an example
第1部: 自己安定の緩和 すてふぁん どぅゔぃむ ポスドク パリ第11大学 LRI CNRS あどばいざ: せばすちゃ てぃくそい
Self-stabilization.
Leader Election (if we ignore the failure detection part)
Faults and fault-tolerance
Atomicity, Non-determinism, Fairness
CS60002: Distributed Systems
ITEC452 Distributed Computing Lecture 5 Program Correctness
Presentation transcript:

Self-stabilizing Distributed Systems Sukumar Ghosh Professor, Department of Computer Science University of Iowa

2 Introductio n

Failures and Perturbations Fact 1. All modern distributed systems are dynamic. Fact 2. Failures and perturbations are a part of such distributed systems.

Classification of failures Crash failure Omission failure Transient failure Byzantine failure Software failure Temporal failure Security failure Environmental perturbations

Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety) continue to hold. Non-masking tolerance. Safety property is temporarily affected, but not liveness. Example 1. Clocks lose synchronization, but recover soon thereafter. Example 2. Multiple processes temporarily enter their critical sections, but thereafter, the normal behavior is restored. Backward error-recovery vs. forward error-recovery

Backward vs. forward error recovery Backward error recovery When safety property is violated, the computation rolls back and resumes from a previous correct state. time rollback Forward error recovery Computation does not care about getting the history right, but moves on, as long as eventually the safety property is restored. True for self-stabilizing systems.

So, what is self-stabilization? Technique for spontaneous healing after transient failure or perturbation. Non-masking tolerance (Forward error recovery). Guarantees eventual safety following failures. Feasibility demonstrated by Dijkstra in his Communications of the ACM 1974 article

Why Self-stabilizing systems? It is nice to have the ability of spontaneous recovery from any initial configuration to a legitimate configuration. It implies that no initialization is ever required. Such systems can be deployed ad hoc, and are guaranteed to function properly in bounded time. Such systems restore their functionality without any extra intervention.

Two properties

10 Self-stabilizing systems State space legal

Example 1: Self-stabilizing mutual exclusion on a ring (Dijkstra 1974) N-1 Consider a unidirectional ring of processes. In the legal configuration, exactly one token will circulate in the network

Stabilizing mutual exclusion on a ring 0 {Process 0} repeat x[0] = x[N-1]→ x[0] := x[0] + 1 mod K forever {Process j > 0} repeat x[j] ≠ x[j -1] → x[j] := x[j-1] forever The state of process j is x[j] ∈ {0, 1, 2, K-1}. (Also, K > N) (TOKEN = ENABLED GUARD) Hand-execute this first, before proceeding further. Start the system from an arbitrary initial configuration

Stabilizing mutual exclusion on a ring 0 {Process 0} repeat x[0] = x[N-1]→ x[0] := x[0] + 1 mod K forever {Process j > 0} repeat x[j] ≠ x[j -1] → x[j] := x[j-1] forever (N=6, K=7)

Outline of Correctness Proof (Absence of deadlock). If no process j>0 has an enabled guard then x[0]=x[1]=x[2]= … x[N-1]. But it means that the guard of process 0 is enabled. ( Proof of Closure) In a legal configuration, if a process executes an action, then its own guard is disabled, and its successor’s guard becomes enabled. So, the number of tokens (= enabled guards) remains unchanged. It means that if the system is already in a good configuration, it remains so (unless, of course a failure occurs)

Correctness Proof (continued) Proof of Convergence Let x be one of the “missing states” in the system. Processes 1..N-1 acquire their states from their left neighbor Eventually process 0 attains the state x (liveness) Thereafter, all processes attain the state x before process 0 becomes enabled again. This is a legal configuration (only process 0 has a token) Thus the system is guaranteed to recover from a bad configuration to a good configuration

To disprove To prove that a given algorithm is not self-stabilizing to L, it is sufficient to show that. either (1) there exists a deadlock configuration, or (2) there exists a cycle of illegal configurations (≠L) in the history of the computation, or (3) The systems stabilizes to a configuration L‘≠ L

Exercise Consider a completely connected network of n processes numbered 0, 1, …, n-1. Each process i has a variable L(i) that is initialized to i. The goal of the system is to make the values of all L(i)’s identical: For this, each process i executes the following algorithm: repeat ∃ j ∈ neighbor (i): L(i) ≠ L(j) → L(i) := L(j) forever Question: Is the algorithm self-stabilizing?

Example 2: Clock phase synchronization 0 n System of n clocks ticking at the same rate. Each clock is 3-valued, i,e it ticks as 0, 1, 2, 0, 1, 2… A failure may arbitrarily alter the clock phases. The clocks phases need to stabilize, i.e. they need to return to the same phase.. Design an algorithm for this.

The algorithm Clock phase synchronization {Program for each clock} (c[k] = phase of clock k, initially arbitrary) repeat R1. ∃ j: j ∈ N(i) :: c[j] = c[i] +1 mod 3  c[i] := c[i] + 2 mod 3 R2. ∀ j: j ∈ N(i) :: c[j] ≠ c[i] +1 mod 3  c[i] := c[i] + 1 mod 3 forever First, verify that it “appears to be” correct. Work out a few examples. 0 n ∀ k: c[k] ∈ {0.1.2}

Why does it work? Let D = d[0] + d[1] + d[2] + … + d[n-1] d[i] = 0 if no arrow points towards clock i ; = i + 1 if a ← points towards clock i ; = n – I if a → points towards clock i ; = 1 if both → and ← point towards clock i. By definition, D ≥ 0. Also, D decreases after every step in the system. So the number of arrows must reduce to Understand the game of arrows 012n-1

Exercise 1.Why 3-valued clocks? What happened for larger clocks? 2.Will the algorithm work for a ring topology? Why or why not?

Example 3: Self-stabilizing spanning tree Problem description Given a connected graph G = (V,E) and a root r, design an algorithm for maintaining a spanning tree in presence of transient failures that may corrupt the local states of processes (and hence the spanning tree). Let n = |V|

Different scenarios Parent(2) is corrupted

Different scenarios The distance variable L(3) is corrupted

Definitions

The algorithm repeat R1. (L(i) ≠ n) ∧ (L(i) ≠ L(P(i)) +1) ∧ (L(P(i)) ≠ n)  L(i) :=L(P(i) + 1 R2. (L(i) ≠ n) ∧ (L(P(i))=n)  L(i):=n R3. (L(i) =n) ∧ ( ∃ k ∈ N(i):L(k)<n-1)  L(i) :=L(k)+1; P(i):=k forever P(2) is corrupted The blue labels denote the values of L

Proof of stabilization Define an edge from i to P(i) to be well-formed, when L(i) ≠ n, L(P(i)) ≠ n and L(i) = L(P(i)) +1. In any configuration, the well-formed edges form a spanning forest. Delete all edges that are not well- formed. Each tree T(k) in the forest is identified by k, the lowest value of L in that tree.

Example In the sample graph shown earlier, the original spanning tree is decomposed into two well-formed trees T(0) = {0, 1} T(2) = {2, 3, 4, 5} Let F(k) denote the number of T(k)’s in the forest. Define a tuple F = (F(0), F(1), F(2) …, F(n)). For the sample graph, F = (1, 0, 1, 0, 0, 0) after node 2’s has a transient failure.

Proof of stabilization Minimum F = (1,0,0,0,0,0) {legal configuration} Maximum F = (1, n-1, 0, 0, 0, 0) (considering lexicographic order) With each action of the algorithm, F decreases lexicographically. Verify the claim! This proves that eventually F becomes (1,0,0,0,0,0) and the spanning tree stabilizes. What is an upper bound time complexity of this algorithm?

Conclusion  Classical self-stabilization does not allow the codes to be corrupted. Can we do anything about it?  The fault-containment problem  The concept of transient fault is now quite relaxed. Such failures now include -- perturbations (like node mobility in WSN) -- change in environment -- change in the scale of systems, -- change in user demand of resources The tools for tolerating these are varied, and still evolving.

Questions?

The University of Iowa32 Applications Concepts similar to stabilization are present in the networking area for quite some time. Wireless sensor networks have given us a new platform. Many examples of systems that recover from limited perturbations. These mostly characterize a few self-healing and self-organizing systems.

The University of Iowa33 Pursuer Evader Games In a disaster zone, rescuers (pursuers) try to track hot spots (evaders) using sensor networks. How soon can the pursuers catch the evader ( Arora, Demirbas, Gouda 2003 )

The University of Iowa34 Pursuer Evader Games Evader is omniscient; Strategy of evader is unknown Pursuer can only see state of nearest node; Pursuer moves faster than evader Design a program for nodes and pursuer so that itr can catch evader (despite the occurrence of faults)

The University of Iowa35 Main idea A balanced tree (DFS) is continuously maintained with the evader as the root. The pursuer climbs “up the tree” to reach the evader.