Download presentation
Presentation is loading. Please wait.
Published byDerek Hudson Modified over 8 years ago
2
Distributed Error- Confinement Shay Kutten (Technion) with Boaz Patt-Shamir (Tel Aviv U.) Yossi Azar (Tel Aviv U.)
3
Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.
4
S A B C Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via S : 7+4=11 Internet routing: Node A compute shortest path to C based on messages from S. Traffic to C (1) Assume no fault: distance 7 to C C
5
Motivation: “error propagation” (example) 7 Message from S to A: 42 Traffic to C (2) with fault (at B): distance 7 to C State corrupting fault (adversary modifies data memory) distance 0 to C My distance to C via S : 7+4=11 S A B C C
6
State corrupting fault (self stabilization): Not malicious! Just a one time change of memory content. 7 42 State corrupting fault (adversary modifies data memory) distance 0 to C S A B C
7
Motivation: “error propagation” (example) 7 Message from S to A: 4 2 Traffic to C (2) With fault (at B): distance 7 to C distance 0 to C fault My distance to C via S : 7+4=11 S A B C C
8
Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via B : 2+0=2 2 Traffic to C distance 7 to C distance 0 to C fault (3) B’s fault propagated to A S A B C C
9
Motivation: “error propagation” (example) 7 Message from S to A: 4 My distance to C via B : 2+0=2 2 (4) Traffic to C is sent the wrong way as a result of the fault propagation distance 7 to C distance 0 to C fault B’s fault propagated to A S A B C C C
10
I have distance 0 to everybody fault C C This is, actually, how the Internet (than Called “ARPANET”) in 1980 D D D S crashed S A B C
11
A B distance 0 to C fault My distance to C via S : 7+4=11 C S “Error confinement”: non faulty node A outputs only correct output (or no output at all) Output (to routing:) I do not believe you! Sounds impossible?
12
Error Confinement (Formally ) : problem specification, P: protocol. P solves with error confinement if for any execution of P with behavior (possibly containing a state corrupting fault), there exists a behavior ’ & for all non-faulty nodes v: ’ v = v –(“stabilization” deals also with faulty nodes) –(behavior- ignoring time)
13
Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The core- bootstrapping idea idea for algorithm (4) Optimization question and answer for “core” construction.
14
S t 0 A B C D time t t 1 2 Introducing a new measure of fault resilience: The resilience of a protocol is smaller at first Environment (e.g. user) 0 Input is given to S at time t
15
t 0 S A B C D time t t 1 2 The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives Input is to S at time t 0 t f If adversary changes the state of S at time t f shortly after the input
16
t 0 S A B C D time t 1 t 2 The resilience of a protocol is smaller at first (cont.) Environment (e.g. user) gives Input to S at time t 0 t f If adversary changes the state of S at time t f shortly after the input then the input is lost forever
17
S A BD S A B C D time t f The resilience of a protocol grows with time t 0 t 1 t 2 However, a fault, even in S, can be tolerated if it waits until after S distributed the input value input C
18
S A B C D S A B C D time t f t f The resilience of a protocol grows with time (cont.) t 0 t 1 t 2 However, a fault, even in S, can be tolerated if it waits until after S distributed the input value distribution input
19
S A B C D S A B C D time tftf tftf The resilience of a protocol grows with time t 0 t 1 t 2 A fault even in S can be tolerated if it “waits” until after S distributed the input value distribution input
20
A B C D S A B C D time t f t f The resilience of a protocol grows with time t 0 t 1 t 2 A fault even in S can be tolerated if it “waits” until after S distributed the input value input distribution S
21
S A B C D S A B C D time t f t f The resilience of a protocol grows with time input To destroy the replicated value the adversary needs to hit more nodes at > > tftf t1t1 t0t0 t0t0 t1t1 t0t0
22
t 1 S S A B C D S A B C D S A B C D time t 2 t 3 The resilience continues to grows with time If no faults occurred by some later, then the input is t 3 replicated even further
23
S t 1 A B C D S A B C D S A B C D time t t 2 3 The resilience continues to grows with time The later the faults, the more faults can be tolerated tftf
24
t 1 S S A B C D BD S A B C D Time time t t 2 3 Space The later the faults, the more faults can be tolerated if the protocol is designed to be robust C S A Cone
25
t 1 S S A B C D BD S A B C D time t t 2 3 “Narrow” cone a LESS fault tolerant algorithm S A C Slower replication less nodes offer help
26
t 1 Replication to more nodes faster S S A B C D BD S A B C D time t t 2 3 A “Wider” cone a more fault tolerant algorithm C S A
27
So, a recovery of corrupted values is theoretically possible, for an adversary that is constrained according to a space-time-cone, but what is the algorithm that does the recovery? S time
28
Constraining faults: Agility c -constrained environment: environment generating faults t f time units after the input, ( c 1), only in: with agility c: Broadcast algorithm that guarantees error confinement against c -constrained environments. minority of · |Ball s (c·t f )| nodes. S Ball s c·tfc·tf algorithm V V
29
S S C D D time Algorithm’s “agility” measures the rate the constraint on the adversary can be lifted C S Agility: S
30
Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.
31
The message resides at some nodes we term “core”
32
A node can join the core when it “made sure” it heard the votes of all core nodes
33
A node can join the core when it “made sure” it heard the votes of all core nodes
34
A node can join the core when it “made sure” it heard the votes of all core nodes
35
A node can join the core when it “made sure” it heard the votes of all core nodes
36
and even the fault can be corrected
44
Let us view again the join of one node
45
If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly
46
If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly Disclaimer: any connection to Actual historical rivalry is coincidental
47
If core is such that adversary’s constraint allows hit of only a minority of the core… Then the message passes to the new node correctly Disclaimer: any connection to Actual historical rivalry is coincidental
48
A B distance 0 to C fault My distance to C via S : 7+4=11 C S D “Error confinement”: non faulty node A outputs Only correct output (or no output at all) Output (to routing:) I do not believe you!
49
and even the fault can be corrected
50
When the core grow, the algorithm can withstand more faults.
52
Talk Overview (1) What is error confinement? (2) The new “agility” measure for fault tolerance. (3) The new “core- bootstrapping” idea for algorithm. (4) Optimization question and answer for “core” construction.
53
S A B C D E F G H Dilemma- should we add a node to core ASAP?
54
S A B C D E F G H Advantage- enlarges the core now.
55
S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth
56
S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth
57
S A B C D E F G H Dilemma- should we add a node to core ASAP? Disadvantage: slows future core growth
58
Stages of core growth (example: bad greedy policy) Greedy core growth O(D 2 ) time to reach diameter D Agility = D / D 2 = D
59
Core (T i ) Core(T i-1 ) S R i-1 RiRi U V feasibility: Core at time T i as a function of Core(T i-1 ) Optimize agility subject to
60
Agility at time T i : Radius of Core at time T i Divided by time (T i : i’th time Core grows) Core radius time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility
61
Agility at time T i : Radius of Core at time T i Divided by time (T i : i’th time Core grows) Core radius We calculated optimal T’s, R’s. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility
62
Agility at time T i : Radius of Core at time T i Divided by time (T i : i-th time Core grows) Core radius We calculated optimal T’s, R’s. Constant Agility. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility
63
Agility at time T i : Radius of Core at time T i Divided by time (T i : i-th time Core grows) Core radius Optimal number of core increases (logarithmic!). We calculated optimal T’s, R’s. Constant Agility. time 1 1 R1R1 R2R2 T1T1 T2T2 T3T3 Agility
64
Additional results An error confined protocol to compute distances from a node S ( Bellman-Ford’s algorithm is self stabilizing but is not error-confined). An error confined protocol for broadcast with correct source. Lower bound (mandatory slow down): Consider an error-confined algorithm for BROADCAST, even with correct source. Then no correct node v outputs before time 2·dist(source,v), even in the absence of faults. Generalizations, practical considerations.
65
Some related notions Self stabilization [Dijkstra, Lamport] would bring the system to a consistent global state eventually. If the input is re-injected repeatedly (as in the routing information example) this corrects the states eventually. If the input is not reintroduced by the environment, then the input may be lost forever. Local checking [Afek-K-Yung90], or local detection [Awerbuch-P-Varghese91], together with self stabilizing reset [KatzPerry90,AKY90,AroraGouda90,APV91] would yield agility 1/|V|.
66
Some related notions fault local algorithms [K-Peleg95], or Time adaptive [KP97], or fault containment [GhoshGuptaHermanPemmaraju96] or local stabilization [AroraZhang03] would allow a “small” (relative to the number of faults) error propagation. Snap stabilization [ BuiDattaPetitVillain99 ] considers only the case that some nodes performed a special “purifying” “initiation” action after the faults. A fault may propagate to a non- initiator node until it communicate with initiators. Local stabilization of [AfekDolev97] assumes a node can detect its own faults (since a fault puts a node in a random state).
67
Open problems Many!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.