Hongwei Zhang Anish Arora Continuous fault containment and local stabilization in path-vector routing Hongwei Zhang Anish Arora November 12, 2018
Motivation Study of fault containment has focused largely on cases where faults either stop occurring after certain moment in time or faults occur with low frequency In practice, faults may occur with high frequency, and the interval between faults may be shorter than the time taken for the system to stabilize E.g., under Code Red/Nimda attack (2002), memory overflow causes edge BGP speakers to repeatedly fail-stop and rejoin at a frequency as high as once every minute the oscillation propagates farther away, in spite of MRAI timer and RFD
Objectives Formulate concepts that characterize, and develop mechanisms that achieve the following properties: in the presence of high-frequency faults the impact of faults is always locally contained once faults stop occurring the system stabilizes within time that is a function of the degree of fault perturbation We study these issues in the context of path-vector routing to simplify the presentation, we first present a solution for continuous fault containment and local stabilization in path-vector routing, then we present the concepts
Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks
Fault propagation in path-vector protocols d e [e, d] f [f, e, d] the fresh info. (route-announcement) always lags behind the obsolete info (route-withdrawal) g [g, f, e, d] all are affected unaffected ? h [h, g, f, e, d] i [i, h, g, f, e, d]
Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks
Design pattern of CPV Key idea: to design a mechanism that enables information regarding a new network state to catch up with and stop the propagation of the information regarding the preceding state (which has become obsolete) works whether or not faults stop occurring Parallel diffusing waves (with different propagation speed) + Each stabilization as well as undo-containment wave stabilizes itself; each containment wave is stabilized (and deactivated) by the corresponding stabilization or undo-containment wave + Each contained wave (e.g., a stabilization wave) sets the boundary of the corresponding containing wave (e.g., a containment wave)
Outline of CPV Whenever a node j needs to change state, it engages a containment wave cw0 before engaging a new stabilization wave sw1 so that cw0 stops the previous stabilization wave sw0 from propagating the existing state of j In the presence of high-frequency faults, another fault f may occur before j executes sw1, then there are two cases j does not need to change state any more: j engages an undo-containment wave uw0 to stop cw0 j still needs to change state: j lets cw0 to propagate
A little more detail Containment wave Stabilization wave piggybacks the expected next state of a node to its neighbors, so that a neighbor can decide whether to hold an existing SW is a one-way diffusing process, by which CW can co-exist with the corresponding SW (which is required to contain continuously-occurring faults) Stabilization wave takes into account predicated state when choosing next-hop Undo-containment wave does not introduce new variables
Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks
Protocol CPV ds > α·(dc+U), dc > α·(du+U), du ≥ 0 containment wave
Action SW (contd.) loop freedom a node not in CW does not execute SW, if the next-hop has executed CW nodes not involved in any CW rank higher than those involved in a CW consider the expected next route of a neighbor, if available via a CW
CPV (contd.): actions CW and UW Note: we skip the actions for information synchronization between neighbors here
Example revisited d CW1 SW1 CW2 SW2 UW1 e f g h i
Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks
Generic concepts Objective: Key issue: Concepts defined: to define concepts that capture the desired system properties in the presence of continuously-occurring faults Key issue: to differentiate the impact of faults and protocol actions Concepts defined: Perturbed vs. contaminated node Perturbation size & contamination range F-containment & F-stabilization
Preliminaries A System History H is a sequence q.0, (e.1, t.1), q.1, (e.2, t.2), …, q.(k-1), (e.k, t.k), q.k, …, of alternating system states and events, where an event is either the execution of a protocol action or the occurrence of a fault each state transition “q.(k-1), (e.k, t.k), q.k” means that event e.k at time t.k changes the system state from q.(k-1) to q.k every moment in time, at most one event can occur at a node Given a system history H and a state q.k in H, the history prefix H(q.k) = the subsequence of H that is between q.0 and q.k A computation is a system history (or its suffix) where no fault occurs
Preliminaries (contd.) Given a state q.k and H(q.k), a protocol execution E(q.k) is a set of computations each of which specifies a computation C(q.k, E(q.k)) for a different state q.k’ in H(q.k) that is either the initial state or a state reached immediately after a fault occurs Given q.k, E(q.k), the stabilization set of q.k, S(q.k, E(q.k)), is the set of nodes that need to change state for the system to stabilize from q.k in the absence of faults
Perturbation vs. contamination Given “q.k-1, (e, t), q.k” and E(q.k), the corruption set of e at t cpt(e, t, E(q.k)) = S(q.k, E(q.k)) \ S(q.k-1, E(q.k)) if e is not a state corruption, the correction set of e at t cct(e, t, E(q.k)) = (S(q.k-1, E(q.k)) \ S(q.k, E(q.k))) V.(q.k) For every node j cpt(e, t, E(q.k)), j is perturbed by e if e is a fault j is contaminated via e if e is the execution of a protocol action For every node j cct(e, t, E(q.k)), j is corrected by e
Perturbed vs. contaminated node a perturbed node remains perturbed until it is corrected by a fault or the system reaches a legitimate state a contaminated node remains contaminated until it is corrected by a fault or the execution of a protocol action
Example with existing path-vector protocol d e perturbed f contaminated corrected g h i
Perturbation size & contamination range Given q.k, H(q.k), and E(q.k), the perturbation size at q.k, P(q.k, H(q.k), E(q.k)), is the number of perturbed nodes at q.k The contamination range of a perturbed region S’ at q.k, R(S’, q.k), is the maximum hop-distance from the corresponding set of contaminated nodes to S’
F-containment & F-stabilization A system is F-containing if and only if for every perturbed region S’ at an arbitrary state q.k, R(S’, q.k) = O(F(| S’ |), where F is a function A system is F-stabilizing if and only if starting at an arbitrary state q. k with an arbitrary H(q. k) and E(q.k), the system computation is guaranteed to reach a legitimate state within O(F(P(q.k, H (q.k), E(q.k)))) time in the absence of faults, where F is a function
Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks
Analytical results L = {q: every up node has found its best route at state q} Properties of CPV the contamination range R(S’, q.k) of every perturbed region S’ at any state q.k is O(|S’|) the distance to which a state of a node i propagates is proportional to the time the state lasts starting at any state q.k with an arbitrary H(q.k) and E(q.k), the system where CPV is used reaches a legitimate state within O(F(P(q.k, H(q.k), E(q.k)))) time in the absence of faults F is function reflecting the routing policies used, and is linear if every node chooses a shortest path a system where CPV is used is F-containing, with F being a linear function the higher frequency faults happen to a node, the tighter they are contained
Simulation results SSFNet, a network simulator with standard-conforming protocol implementations Simulation setup parameter setup for CPV and BGP CPV: ds = 30 sec, dc = 10 sec, du = 1 sec BGP: with MRAI timers (30 seconds) and RFD Fault scenario a node repeatedly fail-stops and then rejoins every 30 seconds Internet-type network topology the shortest-path-first policy
Contamination range and the number of nodes affected
Time taken to stabilize
Stability adaptiveness
Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks
Concluding remarks Frequent transient faults do happen (especially when systems work under unexpected conditions) fault containment and stabilization are desirable as well as possible Quality of service and system behavior during stabilization perspectives other than convergence only: time, space, stability, etc. modeling issues: descriptive, derivative continuous fault containment + stabilization local stabilization
Low frequency faults Destination joins Destination fail-stops