Hongwei Zhang Anish Arora

Slides:



Advertisements
Similar presentations
Routing System Stability draft-dimitri-grow-rss-01.txt IETF71 - Philadelphia.
Advertisements

COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Part IV: BGP Routing Instability. March 8, BGP routing updates  Route updates at prefix level  No activity in “steady state”  Routing messages.
Advanced Networks 1. Delayed Internet Routing Convergence 2. The Impact of Internet Policy and Topology on Delayed Routing Convergence.
DSN 2003 A Study of Packet Delivery Performance during Routing Convergence Dan Pei, Lan Wang, Lixia Zhang, UCLA Dan Massey, USC/ISI S. Felix Wu, UC Davis.
Consensus Routing: The Internet as a Distributed System John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson Presented.
1 Interdomain Routing Protocols. 2 Autonomous Systems An autonomous system (AS) is a region of the Internet that is administered by a single entity and.
© 2007 Cisco Systems, Inc. All rights reserved.ICND2 v1.0—3-1 Medium-Sized Routed Network Construction Reviewing Routing Operations.
1 Complexity of Network Synchronization Raeda Naamnieh.
LSRP: Local Stabilization in Shortest Path Routing Hongwei Zhang and Anish Arora Presented by Aviv Zohar.
December 20, 2004MPLS: TE and Restoration1 MPLS: Traffic Engineering and Restoration Routing Zartash Afzal Uzmi Computer Science and Engineering Lahore.
Improving BGP Convergence Through Consistency Assertions Dan Pei, Lan Wang, Lixia Zhang UCLA Xiaoliang Zhao, Daniel Massey, Allison Mankin, USC/ISI S.
LSRP: Local Stabilization in Shortest Path Routing Anish Arora Hongwei Zhang.
On Power-Law Relationships of the Internet Topology CSCI 780, Fall 2005.
1 Relates to Lab 4. This module covers link state routing and the Open Shortest Path First (OSPF) routing protocol. Dynamic Routing Protocols II OSPF.
Outline Max Flow Algorithm Model of Computation Proposed Algorithm Self Stabilization Contribution 1 A self-stabilizing algorithm for the maximum flow.
GS 3 GS 3 : Scalable Self-configuration and Self-healing in Wireless Networks Hongwei Zhang & Anish Arora.
CS401 presentation1 Effective Replica Allocation in Ad Hoc Networks for Improving Data Accessibility Takahiro Hara Presented by Mingsheng Peng (Proc. IEEE.
1 Relates to Lab 4. This module covers link state routing and the Open Shortest Path First (OSPF) routing protocol. Dynamic Routing Protocols II OSPF.
Distance Vector Routing Protocols Distance Vector Routing.
1 Computer Communication & Networks Lecture 22 Network Layer: Delivery, Forwarding, Routing (contd.)
Comparison of Data-driven Link Estimation Methods in Low-power Wireless Networks Hongwei Zhang Lifeng Sang Anish Arora.
Benjamin Gamble. What is Time?  Can mean many different things to a computer Dynamic Equation Variable System State 2.
David Wetherall Professor of Computer Science & Engineering Introduction to Computer Networks Hierarchical Routing (§5.2.6)
Routing Convergence Dan Massey Colorado State University.
Detecting Selective Dropping Attacks in BGP Mooi Chuah Kun Huang November 2006.
Pitch Patarasuk Policy Disputes in Path-Vector Protocol A Safe Path Vector Protocol The Stable Paths Problem and Interdomain routing.
Stabilization Presented by Xiaozhou David Zhu. Contents What-is Motivation 3 Definitions An Example Refinements Reference.
Fault Management in Mobile Ad-Hoc Networks by Tridib Mukherjee.
Dynamic Routing Protocols II OSPF
Routing Algorithms and IP Addressing Routing Algorithms must be ▪ Correctness ▪ Simplicity ▪ Robustness ▪ Stability ▪ Fairness ▪ Optimality.
1 Chapter 4: Internetworking (IP Routing) Dr. Rocky K. C. Chang 16 March 2004.
Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.
Routing Semester 2, Chapter 11. Routing Routing Basics Distance Vector Routing Link-State Routing Comparisons of Routing Protocols.
Performance Comparison of Ad Hoc Network Routing Protocols Presented by Venkata Suresh Tamminiedi Computer Science Department Georgia State University.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
1 Relates to Lab 4. This module covers link state routing and the Open Shortest Path First (OSPF) routing protocol. Dynamic Routing Protocols II OSPF.
Dynamic Routing Protocols II OSPF
Xin Liu Department of Computer Science Univ. of California, Davis
Jian Wu (University of Michigan)
Vineet Mittal Should more be added here Committee Members:
Jordan Adamek Mikhail Nesterenko Sébastien Tixeuil
Intra-Domain Routing Jacob Strauss September 14, 2006.
Routing: Distance Vector Algorithm
Interdomain routing V. Arun
Dynamic Routing Protocols II OSPF
A stability-oriented approach to improving BGP convergence
Student: Fang Hui Supervisor: Teo Yong Meng
COS 561: Advanced Computer Networks
CS 3700 Networks and Distributed Systems
RFC 1058 & RFC 2453 Routing Information Protocol
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
Effective Replica Allocation
The Network Layer Network Layer Design Issues:
CS 3700 Networks and Distributed Systems
COS 561: Advanced Computer Networks
Guest Lecture by David Johnston
COS 461: Computer Networks Spring 2014
ECE 352 Digital System Fundamentals
BGP Interactions Jennifer Rexford
COS 461: Computer Networks
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Reasons for unnecessary route changes: (1) not knowing the cause
Computer Networks Protocols
Data Communication: Routing algorithms
Routing in Mobile Wireless Networks Neil Tang 11/14/2008
Authors: Jinliang Fan and Mostafa H. Ammar
Presentation transcript:

Hongwei Zhang Anish Arora Continuous fault containment and local stabilization in path-vector routing Hongwei Zhang Anish Arora November 12, 2018

Motivation Study of fault containment has focused largely on cases where faults either stop occurring after certain moment in time or faults occur with low frequency In practice, faults may occur with high frequency, and the interval between faults may be shorter than the time taken for the system to stabilize E.g., under Code Red/Nimda attack (2002), memory overflow causes edge BGP speakers to repeatedly fail-stop and rejoin at a frequency as high as once every minute the oscillation propagates farther away, in spite of MRAI timer and RFD

Objectives Formulate concepts that characterize, and develop mechanisms that achieve the following properties: in the presence of high-frequency faults the impact of faults is always locally contained once faults stop occurring the system stabilizes within time that is a function of the degree of fault perturbation We study these issues in the context of path-vector routing to simplify the presentation, we first present a solution for continuous fault containment and local stabilization in path-vector routing, then we present the concepts

Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks

Fault propagation in path-vector protocols d e [e, d] f [f, e, d] the fresh info. (route-announcement) always lags behind the obsolete info (route-withdrawal) g [g, f, e, d] all are affected unaffected ? h [h, g, f, e, d] i [i, h, g, f, e, d]

Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks

Design pattern of CPV Key idea: to design a mechanism that enables information regarding a new network state to catch up with and stop the propagation of the information regarding the preceding state (which has become obsolete) works whether or not faults stop occurring Parallel diffusing waves (with different propagation speed) + Each stabilization as well as undo-containment wave stabilizes itself; each containment wave is stabilized (and deactivated) by the corresponding stabilization or undo-containment wave + Each contained wave (e.g., a stabilization wave) sets the boundary of the corresponding containing wave (e.g., a containment wave)

Outline of CPV Whenever a node j needs to change state, it engages a containment wave cw0 before engaging a new stabilization wave sw1 so that cw0 stops the previous stabilization wave sw0 from propagating the existing state of j In the presence of high-frequency faults, another fault f may occur before j executes sw1, then there are two cases j does not need to change state any more: j engages an undo-containment wave uw0 to stop cw0 j still needs to change state: j lets cw0 to propagate

A little more detail Containment wave Stabilization wave piggybacks the expected next state of a node to its neighbors, so that a neighbor can decide whether to hold an existing SW is a one-way diffusing process, by which CW can co-exist with the corresponding SW (which is required to contain continuously-occurring faults) Stabilization wave takes into account predicated state when choosing next-hop Undo-containment wave does not introduce new variables

Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks

Protocol CPV ds > α·(dc+U), dc > α·(du+U), du ≥ 0 containment wave

Action SW (contd.) loop freedom a node not in CW does not execute SW, if the next-hop has executed CW nodes not involved in any CW rank higher than those involved in a CW consider the expected next route of a neighbor, if available via a CW

CPV (contd.): actions CW and UW Note: we skip the actions for information synchronization between neighbors here

Example revisited d CW1 SW1 CW2 SW2 UW1 e f g h i

Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks

Generic concepts Objective: Key issue: Concepts defined: to define concepts that capture the desired system properties in the presence of continuously-occurring faults Key issue: to differentiate the impact of faults and protocol actions Concepts defined: Perturbed vs. contaminated node Perturbation size & contamination range F-containment & F-stabilization

Preliminaries A System History H is a sequence q.0, (e.1, t.1), q.1, (e.2, t.2), …, q.(k-1), (e.k, t.k), q.k, …, of alternating system states and events, where an event is either the execution of a protocol action or the occurrence of a fault each state transition “q.(k-1), (e.k, t.k), q.k” means that event e.k at time t.k changes the system state from q.(k-1) to q.k every moment in time, at most one event can occur at a node Given a system history H and a state q.k in H, the history prefix H(q.k) = the subsequence of H that is between q.0 and q.k A computation is a system history (or its suffix) where no fault occurs

Preliminaries (contd.) Given a state q.k and H(q.k), a protocol execution E(q.k) is a set of computations each of which specifies a computation C(q.k, E(q.k)) for a different state q.k’ in H(q.k) that is either the initial state or a state reached immediately after a fault occurs Given q.k, E(q.k), the stabilization set of q.k, S(q.k, E(q.k)), is the set of nodes that need to change state for the system to stabilize from q.k in the absence of faults

Perturbation vs. contamination Given “q.k-1, (e, t), q.k” and E(q.k), the corruption set of e at t cpt(e, t, E(q.k)) = S(q.k, E(q.k)) \ S(q.k-1, E(q.k)) if e is not a state corruption, the correction set of e at t cct(e, t, E(q.k)) = (S(q.k-1, E(q.k)) \ S(q.k, E(q.k)))  V.(q.k) For every node j  cpt(e, t, E(q.k)), j is perturbed by e if e is a fault j is contaminated via e if e is the execution of a protocol action For every node j  cct(e, t, E(q.k)), j is corrected by e

Perturbed vs. contaminated node a perturbed node remains perturbed until it is corrected by a fault or the system reaches a legitimate state a contaminated node remains contaminated until it is corrected by a fault or the execution of a protocol action

Example with existing path-vector protocol d e perturbed f contaminated corrected g h i

Perturbation size & contamination range Given q.k, H(q.k), and E(q.k), the perturbation size at q.k, P(q.k, H(q.k), E(q.k)), is the number of perturbed nodes at q.k The contamination range of a perturbed region S’ at q.k, R(S’, q.k), is the maximum hop-distance from the corresponding set of contaminated nodes to S’

F-containment & F-stabilization A system is F-containing if and only if for every perturbed region S’ at an arbitrary state q.k, R(S’, q.k) = O(F(| S’ |), where F is a function A system is F-stabilizing if and only if starting at an arbitrary state q. k with an arbitrary H(q. k) and E(q.k), the system computation is guaranteed to reach a legitimate state within O(F(P(q.k, H (q.k), E(q.k)))) time in the absence of faults, where F is a function

Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks

Analytical results L = {q: every up node has found its best route at state q} Properties of CPV the contamination range R(S’, q.k) of every perturbed region S’ at any state q.k is O(|S’|) the distance to which a state of a node i propagates is proportional to the time the state lasts starting at any state q.k with an arbitrary H(q.k) and E(q.k), the system where CPV is used reaches a legitimate state within O(F(P(q.k, H(q.k), E(q.k)))) time in the absence of faults F is function reflecting the routing policies used, and is linear if every node chooses a shortest path a system where CPV is used is F-containing, with F being a linear function the higher frequency faults happen to a node, the tighter they are contained

Simulation results SSFNet, a network simulator with standard-conforming protocol implementations Simulation setup parameter setup for CPV and BGP CPV: ds = 30 sec, dc = 10 sec, du = 1 sec BGP: with MRAI timers (30 seconds) and RFD Fault scenario a node repeatedly fail-stops and then rejoins every 30 seconds Internet-type network topology the shortest-path-first policy

Contamination range and the number of nodes affected

Time taken to stabilize

Stability adaptiveness

Outline Fault propagation in path-vector protocols CPV design pattern protocol Generic concepts for tolerating high-frequency faults Analytical & simulation results for CPV Concluding remarks

Concluding remarks Frequent transient faults do happen (especially when systems work under unexpected conditions) fault containment and stabilization are desirable as well as possible Quality of service and system behavior during stabilization perspectives other than convergence only: time, space, stability, etc. modeling issues: descriptive, derivative continuous fault containment + stabilization  local stabilization

Low frequency faults Destination joins Destination fail-stops