Distributed Error- Confinement Shay Kutten (Technion) with Yossi Azar (Tel Aviv U.) Boaz Patt-Shamir (Tel Aviv U.)

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
Chapter 6 - Convergence in the Presence of Faults1-1 Chapter 6 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Shlomi Dolev, All Rights.
Chapter 4 Distributed Bellman-Ford Routing
PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Self Stabilizing Algorithms for Topology Management Presentation: Deniz Çokuslu.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
Prepared by Ilya Kolchinsky.  n generals, communicating through messengers  some of the generals (up to m) might be traitors  all loyal generals should.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Self Stabilization 1.
Consensus problem Agreement. All processes that decide choose the same value. Termination. All non-faulty processes eventually decide. Validity. The common.
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
The Byzantine Generals Problem (M. Pease, R. Shostak, and L. Lamport) January 2011 Presentation by Avishay Tal.
Chapter 4 - Self-Stabilizing Algorithms for Model Conservation4-1 Chapter 4: roadmap 4.1 Token Passing: Converting a Central Daemon to read/write 4.2 Data-Link.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
LSRP: Local Stabilization in Shortest Path Routing Hongwei Zhang and Anish Arora Presented by Aviv Zohar.
CMPE 150- Introduction to Computer Networks 1 CMPE 150 Fall 2005 Lecture 22 Introduction to Computer Networks.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
CPSC 668Self Stabilization1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
CS294, YelickSelf Stabilizing, p1 CS Self-Stabilizing Systems
Network Layer Design Isues Store-and-Forward Packet Switching Services Provided to the Transport Layer The service should be independent of the router.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
Chapter Resynchsonous Stabilizer Chapter 5.1 Resynchsonous Stabilizer Self-Stabilization Shlomi Dolev MIT Press, 2000 Draft of Jan 2004, Shlomi.
The Byzantine Generals Strike Again Danny Dolev. Introduction We’ll build on the LSP presentation. Prove a necessary and sufficient condition on the network.
Aodv. Distance vector routing Belman principle AODV - overview Similar to DSR –On demand –Route request when needed and route reply when a node knows.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Selected topics in distributed computing Shmuel Zaks
On Probabilistic Snap-Stabilization Karine Altisen Stéphane Devismes University of Grenoble.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
Consensus and Its Impossibility in Asynchronous Systems.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
Fault-containment in Weakly Stabilizing Systems Anurag Dasgupta Sukumar Ghosh Xin Xiao University of Iowa.
Fault Tolerance Computer Programs that Can Fix Themselves! Prof’s Paul Sivilotti and Tim Long Dept. of Computer & Info. Science The Ohio State University.
DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch Set 11: Asynchronous Consensus 1.
The Complexity of Distributed Algorithms. Common measures Space complexity How much space is needed per process to run an algorithm? (measured in terms.
CS294, Yelick Consensus revisited, p1 CS Consensus Revisited
Internet Routing r Routing algorithms m Link state m Distance Vector m Hierarchical routing r Routing protocols m RIP m OSPF m BGP.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Tree Constructions Distributed Algorithms for Multi-Agent Networks Instructor: K. Sinan YILDIRIM.
CS 542: Topics in Distributed Systems Self-Stabilization.
Self-stabilizing energy-efficient multicast for MANETs.
Chapter 21 Asynchronous Network Computing with Process Failures By Sindhu Karthikeyan.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Spring 2000CS 4611 Routing Outline Algorithms Scalability.
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
Distributed Error- Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)
Distance Vector Routing
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Day 13 Intro to MANs and WANs. MANs Cover a larger distance than LANs –Typically multiple buildings, office park Usually in the shape of a ring –Typically.
Network Layer.
The consensus problem in distributed systems
Model and complexity Many measures Space complexity Time complexity
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Intra-Domain Routing Jacob Strauss September 14, 2006.
Routing: Distance Vector Algorithm
Alternating Bit Protocol
Distributed Consensus
Agreement Protocols CS60002: Distributed Systems
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Distributed Error- Confinement
Network Layer.
Presentation transcript:

Distributed Error- Confinement Shay Kutten (Technion) with Yossi Azar (Tel Aviv U.) Boaz Patt-Shamir (Tel Aviv U.)

Talk Overview (1) (Confinement in) the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

“Self Stabilization” versus Error Confinement - Error confinement can be studied in the context of any faults model - We study error confinement in the context of “self stabilization” (explained below) since if we manage to handle a sever kind of faults, handling other faults may be easier.

Common model for distributed algorithms A C B D E Node Link Message A, B, … Unique node Ids. -No shared memory. -Time complexity: sending a message over a link= time unit (at most, for asynchronous net.) text X=3 state weight name

“Self Stabilization” - Node’s state: value of all its variable - Global state: states of all nodes - Legal states: set of global states (those desired by the algorithm designers) - Stabilization: legality: starting from any global state, eventually the state is legal closure: starting from a legal global state, no illegal state is reached (except by faults)

“Self Stabilization” - Node’s state: value of all its variable - Global state: states of all nodes - Legal states: set of global states (those desired by the algorithm designers) - Stabilization: legality: starting from any global state, eventually the state is legal closure: starting from a legal global state, no illegal state is reached (except by faults) A “fault” means starting in an illegal state. Only the state may be faulty, not the program!

Self Stabilization example: Token passing token Legality: - Exactly ONE node has the token A B C D E

Self Stabilization example: Token passing token Legality: - Exactly ONE node has the token - The token circulates by messages A B C D E

Self Stabilization example: Token passing token Legality: - Exactly ONE node has the token - The token circulates A B C D E

Self Stabilization example: Token passing token Legality: - Exactly ONE node has the token - The token circulates A B C D E

Self Stabilization example: Token passing Legality: - Exactly ONE node has the token - The token circulates token A B C D E

Self Stabilization Problem Example: Token passing Legality: - Exactly ONE node has the token - The token circulates token faulttoken A fault brings the system to an illegal global state A B C D E

Talk Overview (1) Confinement in the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance. (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

S A B C Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via S : 7+4=11 Internet routing: Node A compute shortest path to C based on messages from S. Traffic to C (1) Assume no fault: distance 7 to C C

Motivation: “error propagation” (example) 7 Message from S to A: 42 Traffic to C (2) with fault (at B): distance 7 to C State corrupting fault (adversary modifies data memory) distance 0 to C My distance to C via S : 7+4=11 S A B C C

Recall: state corrupting fault (self stabilization): Not malicious! Just a one time change of memory content State corrupting fault (adversary modifies data memory) distance 0 to C S A B C

Motivation: “error propagation” (example) 7 Message from S to A: 4 2 Traffic to C (2) With fault (at B): distance 7 to C distance 0 to C fault My distance to C via S : 7+4=11 S A B C C

Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via B : 2+0=2 2 Traffic to C distance 7 to C distance 0 to C fault (3) B’s fault propagated to A S A B C C

Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via B : 2+0=2 2 (4) Traffic to C is sent the wrong way as a result of the fault propagation distance 7 to C distance 0 to C fault B’s fault propagated to A S A B C C C

I have distance 0 to everybody fault C C This is, actually, how the Internet (than Called “ARPANET”) in 1980 D D D S crashed S A B C

A B distance 0 to C fault My distance to C via S : 7+4=11 C S “Error confinement”: non faulty node A outputs only correct output (or no output at all) Output (to routing:) I do not believe you! Sounds impossible?

Error Confinement (Formally )  : problem specification, P: protocol. P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior  ’  & for all non-faulty nodes v:  ’ v =  v

Error Confinement (Formally )  : problem specification, P: protocol. P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior  ’  & for all non-faulty nodes v:  ’ v =  v (behavior- ignoring time)

Error Confinement (Formally )  : problem specification, P: protocol. P solves  with error confinement if for any execution of P with behavior  (possibly containing a state corrupting fault), there exists a behavior  ’  & for all non-faulty nodes v:  ’ v =  v (“stabilization” deals also with faulty nodes)

Talk Overview (1) Confinement in the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance. (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

S A B C D time t t 1 2 Introducing a new measure of fault resilience: The resilience of a protocol is smaller at first Environment (e.g. user) 0 t 0 t Input is given to S at time

t 0 S A B C D time t t 1 2 The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives input to S at time t 0 t f If adversary changes the state of S at time t f shortly after the input

t 0 S A B C D time t 1 t 2 The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives Input to S at time t 0 t f If adversary changes the state of S at time t f shortly after the input then the input is lost forever

S A BD S A B C D time t f The resilience of a protocol grows with time t 0 t 1 t 2 However, a fault, even in S, can be tolerated if it waits until after S distributed the input value input C

S A B C D S A B C D time t f t f The resilience of a protocol grows with time (cont.) t 0 t 1 t 2 However, a fault, even in S, can be tolerated if it waits until after S distributed the input value distribution input

S A B C D S A B C D time tftf tftf The resilience of a protocol grows with time t 0 t 1 t 2 A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution input

A B C D S A B C D time t f t f The resilience of a protocol grows with time t 0 t 1 t 2 A fault even in S can be tolerated if it “waits” until after S distributed the input value input distribution S

S A B C D S A B C D time t f t f The resilience of a protocol grows with time input To destroy the replicated value the adversary needs to hit more nodes at > > tftf t1t1 t0t0 t0t0 t1t1 t0t0

t 1 S S A B C D S A B C D S A B C D time t 2 t 3 The resilience continues to grows with time If no faults occurred by some later, then the input is t 3 replicated even further

S t 1 A B C D S A B C D S A B C D time t t 2 3 The resilience continues to grows with time The later the faults, the more faults can be tolerated tftf

t 1 S S A B C D BD S A B C D Time time t t 2 3 Space The later the faults, the more faults can be tolerated if the protocol is designed to be robust C S A Cone

t 1 S S A B C D BD S A B C D time t t 2 3 “Narrow” cone a LESS fault tolerant algorithm S A C Slower replication less nodes offer help

t 1 Replication to more nodes faster S S A B C D BD S A B C D time t t 2 3 A “Wider” cone a more fault tolerant algorithm C S A

So, a recovery of corrupted values is theoretically possible, for an adversary that is constrained according to a space-time-cone, but what is the algorithm that does the recovery? S time

Constraining faults: Agility c -constrained environment: environment generating faults t f time units after the input, ( c  1), only in: with agility c: Broadcast algorithm that guarantees error confinement against c -constrained environments. minority of · |Ball s (c·t f )| nodes. S Ball s c·tfc·tf algorithm V V

S S C D D time Algorithm’s “agility” measures the rate the constraint on the adversary can be lifted C S Agility: S

Talk Overview (1) Confinement in the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance. (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

The message resides at some nodes we term “core”

A node can join the core when it “made sure” it heard the votes of all core nodes

A node can join the core when it “made sure” it heard the votes of all core nodes

A node can join the core when it “made sure” it heard the votes of all core nodes

A node can join the core when it “made sure” it heard the votes of all core nodes

and even the fault can be corrected

Let us view again the join of one node

If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly

If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly Disclaimer: any connection to Actual historical rivalry is coincidental

Passing a correct message via a faulty node Assume, for now, that Italy’s vote is not distorted, though it passes via a faulty node. One way to implement- crypto. We introduced an error confined protocol (non-cryptographic)

A B distance 0 to C fault My distance to C via S : 7+4=11 C S D “Error confinement”: non faulty node A outputs Only correct output (or no output at all) Output (to routing:) I do not believe you!

and even the fault can be corrected

When the core grow, the algorithm can withstand more faults.

Talk Overview (1) Confinement in the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance. (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

S A B C D E F G H Dilemma- should we add a node to core ASAP?

S A B C D E F G H Advantage- enlarges the core now.

S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth

S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth

S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth

Stages of core growth (example: bad greedy policy) Greedy core growth  O(D 2 ) time to reach diameter D  Agility = D / D 2 = D

Core (T i ) Core(T i-1 ) S R i-1 RiRi U V feasibility: Core at time T i as a function of Core(T i-1 ) Optimize agility subject to

Agility at time T i : Radius of Core at time T i Divided by time (T i : i’th time Core grows) Core’s radius time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility

Agility at time T i : Radius of Core at time T i Divided by time (T i : i’th time Core grows) Core radius We calculated optimal T’s, R’s. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility

Agility at time T i : Radius of Core at time T i Divided by time (T i : i-th time Core grows) Core radius We calculated optimal T’s, R’s. Constant Agility. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility

Agility at time T i : Radius of Core at time T i Divided by time (T i : i-th time Core grows) Core radius Optimal number of core increases (logarithmic!). We calculated optimal T’s, R’s. Constant Agility. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility

A Hint About the Analysis Let R(i) be the i’th distinct radius, and let T(i) be the time it starts. Then T(i)  T(i-1)+R(i-1)+R(i). Unfolding: T(i)  R(i)+2  j<i R(j) Example: Greedy cores T(0)=0 T(1)=1 T(2)=4 Easy: T(i)=i 2, so the agility is still 1/n s uv core(T(i-1)) core(T(i)) R(i-1)R(i)R(i)

So what’s the best sequence? Guessing: R(i)=2 i ? implies T(i)  3·2 i, agility about 2 i /3·2 i+1 =1/6. Same agility attained also by R(i)=3 i, (T(i)  2·3 i ) The best sequence is when R(i)=T(i-1). implies agility (Algebraic proof.) Similar classic on-line problem: find a treasure on the real line. Cost: ratio of distance traversed to actual distance.

Talk Overview (1) Confinement in the context of self stabilization (2) What is error confinement? (3) The new “agility” measure for fault tolerance. (4) The new “core- bootstrapping” idea for algorithm. (5) Optimization question and answer for “core” construction. (6) Additional results: practical considerations, building blocks, lower bound (7) Generalizations, open problems

Additional results An error confined protocol to compute distances from a node S ( Bellman-Ford’s algorithm is self stabilizing but is not error-confined). An error confined protocol for broadcast with correct source. Lower bound (mandatory slow down): Consider an error-confined algorithm for BROADCAST, even with correct source. Then no correct node v outputs before time 2·dist(source,v), even in the absence of faults. Generalizations, practical considerations.

(Additional results:) Practical considerations Primary Common practical approach to resiliency by replication: replicate to ONE secondary. Secondary

Practical implications Primary Secondary Common practical approach to resiliency by replication: replicate to ONE secondary that Sometimes transmits too. We replicate to everybody (impractical, especially if Everybody transmits)

Practical implications Primary Secondary Suggestion: have some (here 4) Secondaries. core = intersection of ball(R ) with set of secondaries. Secondary ii

(Additional results:)- a building block: "Single-source distances calculation” Input: “start” at source node (only) Output: distance(d) Behavior: output sequence is  * dist * Self-stabilizing solution- Bellman-Ford (BF) (assuming that distance 0 is hardwired at source): At each step: dist v  1 + min{dist u : u is a neighbor of v} But Bellman-Ford is not error-confined!

Bellman-Ford is not error-confined (example) Red are faulty, 0,1,2,…,9 are distances (for short, we did NOT write node names here) s Dist v = 0 Dist w = 9 i.e.

s (Green marks the nodes that acted) First time unit after the faults in BF s The faults After one time unit

s (Green marks the nodes that acted) First time unit after the faults in BF s The faults After one time unit The error propagated

s Second time unit: s After one time unit: After two time units:

3 BF with errors After three time units: s s After two time units:

3 BF with errors After four time units: s s After three time units:

3 BF with errors After five time units: s 0 3 After four time units: s 0 4 4

3 BF with errors After six time units: s 0 3 After five time units: s 0

3 BF with errors After seven time units: s 0 3 After six time units: s 0

3 BF with errors After eight time units: s 0 3 After seven time units: s 0

3 BF with errors After nine time units: s 0 3 After eight time units: s 0

How to make BF error confined Observation on BF: (Regardless of the initial state, assuming dist s  0.) For all times t and all nodes v: dist v  min( t, distance(s,v)) s After t = two time units: V distance(s,v)

How to make BF error confined Observation on BF: (Regardless of the initial state, assuming dist s  0.) For all times t and all nodes v: dist v  min( t, distance(s,v)) (*node’s variable*) (*actual*) s After t = two time units: V distance(s,v)

How to make BF error confined Observation on BF: (Regardless of the initial state, assuming dist s  0.) For all times t and all nodes v: dist v  min( t, distance(s,v)) (*node’s variable*) (*actual*) –Proof by induction on time s After t = two time units: V distance(s,v)

How to make BF error confined? Observation on BF: (Regardless of the initial state, assuming dist s  0.) For all times t and all nodes v: dist v  min( t, distance(s,v)) s After t = three time units: V distance(s,v) –Proof by induction on time and on dist v (intuitively, if your estimate dist v is wrong, you increase it every time unit, since it is (1+the estimate of a neighbor), that is also increased every time unit)

Observations about BF Recall: dist v  min( t, distance(s,v)) Similarly: If distance(s,v)  d (*the actual distance*) then dist v  d for t  d. –Proof by induction on distance. (Intuitively, wave from S already arrives at V by time t ) Hence, if dist v = d for more than d time units, then distance (s,v)  = d. (*node’s estimate*)

Error-Confined BF Maintain an internal distance estimate dist v a-la BF Maintain a counter to count time since last change: –Whenever a change occurs, reset counter to 0, otherwise increment counter by 1 If counter is larger than distance estimate dist v, make it the output value. Otherwise don’t change estimate! Due diligence before a public change of heart

So why is revised BF error confined? Non-faulty nodes: the counter indeed counts the time since last change –By previous observation: No wrong estimate will survive long enough to write the output value. If no output value installed yet: a correct estimate will be in place in d time units after start signal or fault, and after d additional time units, it will be output Faulty nodes: indeed may make arbitrary output initially- confinement is about non-faulty nodes!!! Eventually estimates stabilize, counter exceeds estimate, and correct output is made (i.e. stabilization) even in faulty nodes.

So why is revised BF error confined? Non-faulty nodes: the counter indeed counts the time since last change –By previous observation: No wrong estimate will survive long enough to write the output value. If no output value installed yet: a correct estimate will be in place in d time units after start signal or fault, and after d additional time units, it will be output Faulty nodes: indeed may make arbitrary output initially- confinement is about non-faulty nodes!!! Eventually estimates stabilize, counter exceeds estimate, and correct output is made (i.e. stabilization) even in faulty nodes.

(Additional:) a Building block: Alg. B: Bcast with correct source source Simple-minded flooding: relay message to all neighbors (marking for termination) A single bad bottleneck can corrupt a lot of nodes But faults are transient...

(Additional:) a Building block: Alg. B: Bcast with correct source Idea: piggyback broadcast value on BF messages Maintain an additional estimate variable for bcast value Reset counter if either estimate changes Output bcast estimate if counter > dist. estimate

Error confinement and slowdown Is the slowdown mandatory? Theorem: Consider an error-confined algorithm for BCAST with correct source. Then no correct node v outputs before time 2·dist(source,v), even in the absence of faults.

Error confinement implies factor-2 slowdown (cont.) Proof: Line graph 0,1,2, is source, focus on i. Derive contradiction from three executions: e 0 : no input. e 1 : input 1 at time 0. In i, e 0 and e 1 differ after step t 1  i. Suppose that i outputs at time t 2 < t 1 +i. Define e 2 : in times 0,...t 1 : like e 0 in time t 1 : input 0 at source; fault at nodes 1,2,3...i-1: State changes to be like in e 1. Node i can’t tell e 1 from e 2 by time t 2 !

Some related notions Self stabilization [Dijkstra, Lamport] would bring the system to a consistent global state eventually. If the input is re-injected repeatedly (as in the routing information example) this corrects the states eventually. If the input is not reintroduced by the environment, then the input may be lost forever. Local checking [Afek-K-Yung90], or local detection [Awerbuch-P-Varghese91], together with self stabilizing reset [KatzPerry90,AKY90,AroraGouda90,APV91] would yield agility 1/|V|.

Some related notions fault local algorithms [K-Peleg95], or Time adaptive [KP97], or fault containment [GhoshGuptaHermanPemmaraju96] or local stabilization [AroraZhang03] would allow a “small” (relative to the number of faults) error propagation. Snap stabilization [ BuiDattaPetitVillain99 ] considers only the case that some nodes performed a special “purifying” “initiation” action after the faults. A fault may propagate to a non- initiator node until it communicate with initiators. Local stabilization of [AfekDolev97] assumes a node can detect its own faults (since a fault puts a node in a random state).

Open problems Many! Relaxing the known topology requirement? Extension to general reactive problems? Other types of cores? Multiple error batches? Time-adaptive error confinement?