Presentation is loading. Please wait.

Presentation is loading. Please wait.

A stability-oriented approach to improving BGP convergence

Similar presentations


Presentation on theme: "A stability-oriented approach to improving BGP convergence"— Presentation transcript:

1 A stability-oriented approach to improving BGP convergence
Hongwei Zhang Anish Arora Zhijun Liu November 20, 2018

2 Background Border Gateway Protocol (BGP) Performance issues with BGP
protocol for Internet inter-domain routing path vector routes, with loop freedom & support for flexible routing policies Performance issues with BGP slow convergence after faults occur O(n!), where n is the number of ASes in a network i.e., may take up to 15 minutes after a node fail-stops instability during convergence may incur many unnecessary route changes

3 State of the art techniques
Consistency assertion (Infocom’02) rejects inconsistent routes (-) does not deal with inconsistency between nodes that are multiple hops apart (-) propagates entry-router-ids of routes from one AS to others, which can cause the propagation of local changes (e.g. entry-router) even if routes themselves do not change Ghost flushing (Infocom’03) withdraws old routes faster than propagating new routes (-) does not prevent the use of invalid routes, even when some “information” regarding the invalidity is known

4 State of the art techniques (contd.)
Route change origin (Globecom’02) & root cause notification (UCLA TR) propagate the ID of node that first withdraws or changes route after fault occurrence (-) do not prevent the use of invalid routes when a node with multiple neighbors fail-stops (-) “root cause notification” propagates entry-router-ids of routes, which can lead to unnecessary propagation of local changes

5 What remains to be studied ?
Our focus The nature of instability during BGP convergence Fundamental limits on improving stability & its relationship with BGP convergence speed Mechanisms to approximate the limits of stability and speed in BGP convergence

6 Outline Network model & fault model
Nature of instability during BGP convergence Protocol G-BGP Analysis & Simulation results Concluding remarks

7 Network model Network G = (V, E, P) Autonomous system (AS) Channel
V: node set E: edge set P: routing policies Autonomous system (AS) a set of strongly-connected nodes Channel the set of links between two ASes Clock every node has a clock the ratio of clock rates between any two nodes is bounded from above by  (no extra constraint on the absolute values of clocks)

8 Fault model Fail-stop of links and nodes Join of links and nodes
a channel (I, J) between ASes I and J is up if there is any up link between a node in I and another node in J; otherwise, (I, J) is down Join of links and nodes Change in routing policy

9 Outline Network model & fault model
Nature of instability during BGP convergence Protocol G-BGP Analysis of G-BGP & simulation results Concluding remarks

10 Nature of instability during BGP convergence
Instability  unnecessary exploration of invalid routes during convergence Two types of instability fault-agnostic instability distribution-inherent instability Illustration here is by examples; and, only for simplicity of presentation, we use a sub-graph of Figure 1 of the paper each node is an AS itself, unless otherwise mentioned

11 Fault-agnostic instability
Definition: a node adopts an invalid route even when certain information has arrived regarding the fault that invalidates the route a channel (a, b) fail-stops h b b withdraws its route : link to the next-hop : unused link a : destination m withdraws its route; but the withdrawal by f is delayed j f m Possible reasons for f delaying its route withdrawal: MRAI timer incurs delay “Link” (f, b) has long delay, especially it may well be a multi-hop route at the physical layer Route ranking at g: [m, b, a] most preferred [f, b, a] secondly preferred [j, h, a] least preferred g g mistakenly regards route [f, b, a] as valid, and adopts it l

12 Distribution-inherent instability: type I
Definition: a node adopts an invalid route because no related information has arrived a channel (a, h) and node b fail-stop simultaneously h b h, f, m withdraw their routes; but the withdrawal by j is delayed j f m no information related to the fail-stop of channel (a, h) has arrived; g mistakenly regards route [f, b, a] as valid, and adopts it. g l

13 Distribution-inherent instability: type II
Definition: a node adopts a valid route that becomes invalid or sub-optimal later a a announces its existence a has not announced its existence h b b gets it route f gets its route; but m is delayed in getting its route. j f m g g adopts route [f, b, a], which becomes sub-optimal once g learns [m, b, a] later l

14 Fault-agnostic vs. distribution-inherent instability
is impossible to completely eliminate in distributed routing not the major cause for slow BGP convergence in practice, especially when most nodes use the shortest-path-first policy Fault-agnostic instability is the major cause for slow BGP convergence can be completely eliminated, if finer-grained fault information and better fault detection mechanisms are used which is the objective of G-BGP

15 Outline Network model & fault model
Nature of instability during BGP convergence Protocol G-BGP Analysis & simulation results Concluding remarks

16 Protocol G-BGP (Grapevine-BGP)
Objective: Eliminate fault-agnostic instability Causes for fault-agnostic instability Not knowing the exact cause for route changes Solution: propagate finer-grained fault information Uncertainty in fault detection Solution: resolve uncertainty by collaborative clarification and quickly marking questionable routes Existence of obsolete information Solution: reject obsolete information using local sequence numbers

17 Presentation note We present here only those cases where a node itself is an AS (or equivalently, all nodes in an AS use the same route) Please refer to the paper for the cases where nodes in an AS use different routes

18 Propagate finer-grained fault information
Depending on type of a fault, different fault information is propagated Point of channel-failure when a channel fail-stops Point of channel-withdrawal when a channel is up but is not used by any node Point of segment-withdrawal when a channel is up but is not used by some node(s) Point of AS-failure when all the nodes in an AS fail-stop Point of node-join when a node joins

19 Example: point of channel-failure
channel (a, b) fail-stops b withdraws its route; b also propagates <b, a>, denoting the fail-stop of (a, b) h b m withdraws its route and propagates <b, a>; the withdrawal by f is delayed j f m Possible reasons for f delaying its route withdrawal: MRAI timer incurs delay “Link” (f, b) has long delay, especially it may well be a multi-hop route at the physical layer g learns that [f, b, a] has become invalid since it passes through (b, a); then, g directly chooses route [j, h, a] without trying [f, b, a] first g l

20 Uncertainty in fault detection
a fail-stops the fail-stop of (a, h) and (a, b), instead of the fail-stop of a, is detected h b h, b, m withdraw their routes; but the withdrawal by j is delayed j m g mistakenly regards [j, h, a] as valid, and adopts it g One solution: g waits for a certain time to see whether j withdraws its route; but the waiting time may be long due to timers such as MinRouteAdvertisementInterval so an alternative solution is desirable l

21 Alternative solution: propagate “state-clarifier” when possible
a also detects the fail-stop of (a, b); then, a generates a state-clarifier <a, {b}>, denoting the fail-stop of (a, b) only, and sends it along h, … a channel (a, b) fail-stops h b the state-clarifier <a, {b}> propagates quickly without subject to timer control when m withdraws its route, j may have propagated <a, {b}> to g, or <a, {b}> will arrive at g soon j m On the other hand, if it is a instead of (a, b) that has fail-stopped, then no state-clarifier will reach g; in this case, g will learn the invalidity of [j, h, a] and avoid using it g when g receives <a, {b}>, g knows that (a, h) is still up; then, g adopts [j, h, a] l

22 Reject obsolete fault information
(a, b) re-joins (a, b) fail-stops the point of channel-failure <b, a> is generated at b, signifying the fail-stop of channel (a, b) h b f b, f, and g change their routes back j <b, a> reaches g and f, but is delayed in reaching m m delayed <b, a> reaches m, and then g g changes its route to [j, h, a], after receiving <b, a> which has become obsolete g g changes its route to [j, h, a] Solution: g detects and rejects obsolete information <b, a>, using local sequence number maintained at b; then, g will not change route after receiving <b, a> which is obsolete l

23 Outline Network model & fault model
Nature of instability during BGP convergence Protocol G-BGP Analysis & simulation results Concluding remarks

24 Properties of G-BGP G-BGP eliminates all fault-agnostic instability
Consequently, in case the destination fail-stops, G-BGP converges with no distribution-inherent instability as well (and thus has no unnecessary route changes) G-BGP always asymptotically improves BGP convergence speed and achieves optimal speed in several scenarios in cases where the shortest-path-first policy is used, G-BGP asymptotically improves BGP convergence speed (except in scenarios where BGP is already optimal, e.g., node join) and achieves optimal speed in several scenarios

25 Simulation results We implemented G-BGP in SSFNet, a network simulator with standard- conforming BGP implementations Our simulations with realistic Internet-type networks show an order of magnitude improvement in convergence stability and speed

26 An example: when a destination fail-stops
Convergence speed: time to converge Convergence stability: the number of unnecessary route changes

27 Outline Network model & fault model
Nature of instability during BGP convergence Protocol G-BGP Analysis & simulation results Concluding remarks

28 Concluding remarks Eliminating fault-agnostic instability significantly improves BGP convergence speed and achieves optimal speed in common scenarios (e.g., node/link fail-stop) Open issues how to characterize and reduce the impact of distribution-inherent instability how to deal with high-frequency unanticipated faults (such as Internet worm attack)


Download ppt "A stability-oriented approach to improving BGP convergence"

Similar presentations


Ads by Google