A stability-oriented approach to improving BGP convergence

Slides:



Advertisements
Similar presentations
Routing Convergence and the Impact of Scale Dan Massey Colorado State University.
Advertisements

Courtesy: Nick McKeown, Stanford
1 Interdomain Routing Protocols. 2 Autonomous Systems An autonomous system (AS) is a region of the Internet that is administered by a single entity and.
© 2007 Cisco Systems, Inc. All rights reserved.ICND2 v1.0—3-1 Medium-Sized Routed Network Construction Reviewing Routing Operations.
1 Measurement of Highly Active Prefixes in BGP Ricardo V. Oliveira, Rafit Izhak-Ratzin, Beichuan Zhang, Lixia Zhang GLOBECOM’05.
1 Tutorial 5 Safe “Peering Backup” Routing With BGP Based on:
LSRP: Local Stabilization in Shortest Path Routing Hongwei Zhang and Anish Arora Presented by Aviv Zohar.
Internet Networking Spring 2004 Tutorial 5 Safe “Peering Backup” Routing With BGP.
Improving BGP Convergence Through Consistency Assertions Dan Pei, Lan Wang, Lixia Zhang UCLA Xiaoliang Zhao, Daniel Massey, Allison Mankin, USC/ISI S.
LSRP: Local Stabilization in Shortest Path Routing Anish Arora Hongwei Zhang.
© 2006 Cisco Systems, Inc. All rights reserved. ICND v2.3—3-1 Determining IP Routes Introducing Distance Vector Routing.
Routing Information Protocol (RIP). Intra-and Interdomain Routing An internet is divided into autonomous systems. An autonomous system (AS) is a group.
Distance Vector Routing Protocols W.lilakiatsakun.
1 Computer Communication & Networks Lecture 22 Network Layer: Delivery, Forwarding, Routing (contd.)
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking BGP, Flooding, Multicast routing.
Reducing Transient Disconnectivity using Anomaly-Cognizant Forwarding Andrey Ermolinskiy, Scott Shenker University of California – Berkeley and ICSI.
Multicast Routing Algorithms n Multicast routing n Flooding and Spanning Tree n Forward Shortest Path algorithm n Reversed Path Forwarding (RPF) algorithms.
Routing Convergence Dan Massey Colorado State University.
TCOM 509 – Internet Protocols (TCP/IP) Lecture 06_a Routing Protocols: RIP, OSPF, BGP Instructor: Dr. Li-Chuan Chen Date: 10/06/2003 Based in part upon.
Interior Gateway Protocols (RIP, OSPF) continued….
An internet is a combination of networks connected by routers. When a datagram goes from a source to a destination, it will probably pass through many.
Routing Semester 2, Chapter 11. Routing Routing Basics Distance Vector Routing Link-State Routing Comparisons of Routing Protocols.
Dynamic routing Routing Algorithm (Dijkstra / Bellman-Ford) – idealization All routers are identical Network is flat. Not true in Practice Hierarchical.
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
The network layer: routing
Dynamic Routing Protocols II OSPF
CMPT 371 Data Communications and Networking Routing in the Internet
(How the routers’ tables are filled in)
(How the routers’ tables are filled in)
Routing Information Protocol (RIP)
ICMP ICMP – Internet Control Message Protocol
TODAY’S TENTATIVE AGENDA
COS 561: Advanced Computer Networks
Surviving Holes and Barriers in Geographic Data Reporting for
Internet Networking recitation #4
Distance-Vector Routing Protocols
Introduction to Networks
Intra-Domain Routing Jacob Strauss September 14, 2006.
Routing: Distance Vector Algorithm
Dynamic routing Routing Algorithm (Dijkstra / Bellman-Ford) – idealization All routers are identical Network is flat. Not true in Practice Hierarchical.
Routing.
(How the routers’ tables are filled in)
Hongwei Zhang Anish Arora
Dynamic Routing Protocols II OSPF
UNICAST ROUTING PROTOCOLS
High Throughput Route Selection in Multi-Rate Ad Hoc Wireless Networks
Department of Computer and IT Engineering University of Kurdistan
COS 561: Advanced Computer Networks
CS 3700 Networks and Distributed Systems
RFC 1058 & RFC 2453 Routing Information Protocol
COS 561: Advanced Computer Networks
The Network Layer Network Layer Design Issues:
PRESENTATION COMPUTER NETWORKS
CS 3700 Networks and Distributed Systems
COS 561: Advanced Computer Networks
Computer Networking Lecture 10: Intra-Domain Routing
An Analysis of BGP Multiple Origin AS (MOAS) Conflicts
COS 461: Computer Networks Spring 2014
Delivery, Forwarding, and Routing
COS 461: Computer Networks
Reasons for unnecessary route changes: (1) not knowing the cause
EE 122: Intra-domain routing: Distance Vector
Distance Vector Routing Protocols
A Routing Protocol for WLAN Mesh
BGP Instability Jennifer Rexford
Distance Vector Routing
Computer Networks Protocols
Routing.
Achieving Resilient Routing in the Internet
Dynamic routing Routing Algorithm (Dijkstra / Bellman-Ford) – idealization All routers are identical Network is flat. Not true in Practice Hierarchical.
Presentation transcript:

A stability-oriented approach to improving BGP convergence Hongwei Zhang Anish Arora Zhijun Liu November 20, 2018

Background Border Gateway Protocol (BGP) Performance issues with BGP protocol for Internet inter-domain routing path vector routes, with loop freedom & support for flexible routing policies Performance issues with BGP slow convergence after faults occur O(n!), where n is the number of ASes in a network i.e., may take up to 15 minutes after a node fail-stops instability during convergence may incur many unnecessary route changes

State of the art techniques Consistency assertion (Infocom’02) rejects inconsistent routes (-) does not deal with inconsistency between nodes that are multiple hops apart (-) propagates entry-router-ids of routes from one AS to others, which can cause the propagation of local changes (e.g. entry-router) even if routes themselves do not change Ghost flushing (Infocom’03) withdraws old routes faster than propagating new routes (-) does not prevent the use of invalid routes, even when some “information” regarding the invalidity is known

State of the art techniques (contd.) Route change origin (Globecom’02) & root cause notification (UCLA TR) propagate the ID of node that first withdraws or changes route after fault occurrence (-) do not prevent the use of invalid routes when a node with multiple neighbors fail-stops (-) “root cause notification” propagates entry-router-ids of routes, which can lead to unnecessary propagation of local changes

What remains to be studied ? Our focus The nature of instability during BGP convergence Fundamental limits on improving stability & its relationship with BGP convergence speed Mechanisms to approximate the limits of stability and speed in BGP convergence

Outline Network model & fault model Nature of instability during BGP convergence Protocol G-BGP Analysis & Simulation results Concluding remarks

Network model Network G = (V, E, P) Autonomous system (AS) Channel V: node set E: edge set P: routing policies Autonomous system (AS) a set of strongly-connected nodes Channel the set of links between two ASes Clock every node has a clock the ratio of clock rates between any two nodes is bounded from above by  (no extra constraint on the absolute values of clocks)

Fault model Fail-stop of links and nodes Join of links and nodes a channel (I, J) between ASes I and J is up if there is any up link between a node in I and another node in J; otherwise, (I, J) is down Join of links and nodes Change in routing policy

Outline Network model & fault model Nature of instability during BGP convergence Protocol G-BGP Analysis of G-BGP & simulation results Concluding remarks

Nature of instability during BGP convergence Instability  unnecessary exploration of invalid routes during convergence Two types of instability fault-agnostic instability distribution-inherent instability Illustration here is by examples; and, only for simplicity of presentation, we use a sub-graph of Figure 1 of the paper each node is an AS itself, unless otherwise mentioned

Fault-agnostic instability Definition: a node adopts an invalid route even when certain information has arrived regarding the fault that invalidates the route a channel (a, b) fail-stops h b b withdraws its route : link to the next-hop : unused link a : destination m withdraws its route; but the withdrawal by f is delayed j f m Possible reasons for f delaying its route withdrawal: MRAI timer incurs delay “Link” (f, b) has long delay, especially it may well be a multi-hop route at the physical layer Route ranking at g: [m, b, a] most preferred [f, b, a] secondly preferred [j, h, a] least preferred g g mistakenly regards route [f, b, a] as valid, and adopts it l

Distribution-inherent instability: type I Definition: a node adopts an invalid route because no related information has arrived a channel (a, h) and node b fail-stop simultaneously h b h, f, m withdraw their routes; but the withdrawal by j is delayed j f m no information related to the fail-stop of channel (a, h) has arrived; g mistakenly regards route [f, b, a] as valid, and adopts it. g l

Distribution-inherent instability: type II Definition: a node adopts a valid route that becomes invalid or sub-optimal later a a announces its existence a has not announced its existence h b b gets it route f gets its route; but m is delayed in getting its route. j f m g g adopts route [f, b, a], which becomes sub-optimal once g learns [m, b, a] later l

Fault-agnostic vs. distribution-inherent instability is impossible to completely eliminate in distributed routing not the major cause for slow BGP convergence in practice, especially when most nodes use the shortest-path-first policy Fault-agnostic instability is the major cause for slow BGP convergence can be completely eliminated, if finer-grained fault information and better fault detection mechanisms are used which is the objective of G-BGP

Outline Network model & fault model Nature of instability during BGP convergence Protocol G-BGP Analysis & simulation results Concluding remarks

Protocol G-BGP (Grapevine-BGP) Objective: Eliminate fault-agnostic instability Causes for fault-agnostic instability Not knowing the exact cause for route changes Solution: propagate finer-grained fault information Uncertainty in fault detection Solution: resolve uncertainty by collaborative clarification and quickly marking questionable routes Existence of obsolete information Solution: reject obsolete information using local sequence numbers

Presentation note We present here only those cases where a node itself is an AS (or equivalently, all nodes in an AS use the same route) Please refer to the paper for the cases where nodes in an AS use different routes

Propagate finer-grained fault information Depending on type of a fault, different fault information is propagated Point of channel-failure when a channel fail-stops Point of channel-withdrawal when a channel is up but is not used by any node Point of segment-withdrawal when a channel is up but is not used by some node(s) Point of AS-failure when all the nodes in an AS fail-stop Point of node-join when a node joins

Example: point of channel-failure channel (a, b) fail-stops b withdraws its route; b also propagates <b, a>, denoting the fail-stop of (a, b) h b m withdraws its route and propagates <b, a>; the withdrawal by f is delayed j f m Possible reasons for f delaying its route withdrawal: MRAI timer incurs delay “Link” (f, b) has long delay, especially it may well be a multi-hop route at the physical layer g learns that [f, b, a] has become invalid since it passes through (b, a); then, g directly chooses route [j, h, a] without trying [f, b, a] first g l

Uncertainty in fault detection a fail-stops the fail-stop of (a, h) and (a, b), instead of the fail-stop of a, is detected h b h, b, m withdraw their routes; but the withdrawal by j is delayed j m g mistakenly regards [j, h, a] as valid, and adopts it g One solution: g waits for a certain time to see whether j withdraws its route; but the waiting time may be long due to timers such as MinRouteAdvertisementInterval so an alternative solution is desirable l

Alternative solution: propagate “state-clarifier” when possible a also detects the fail-stop of (a, b); then, a generates a state-clarifier <a, {b}>, denoting the fail-stop of (a, b) only, and sends it along h, … a channel (a, b) fail-stops h b the state-clarifier <a, {b}> propagates quickly without subject to timer control when m withdraws its route, j may have propagated <a, {b}> to g, or <a, {b}> will arrive at g soon j m On the other hand, if it is a instead of (a, b) that has fail-stopped, then no state-clarifier will reach g; in this case, g will learn the invalidity of [j, h, a] and avoid using it g when g receives <a, {b}>, g knows that (a, h) is still up; then, g adopts [j, h, a] l

Reject obsolete fault information (a, b) re-joins (a, b) fail-stops the point of channel-failure <b, a> is generated at b, signifying the fail-stop of channel (a, b) h b f b, f, and g change their routes back j <b, a> reaches g and f, but is delayed in reaching m m delayed <b, a> reaches m, and then g g changes its route to [j, h, a], after receiving <b, a> which has become obsolete g g changes its route to [j, h, a] Solution: g detects and rejects obsolete information <b, a>, using local sequence number maintained at b; then, g will not change route after receiving <b, a> which is obsolete l

Outline Network model & fault model Nature of instability during BGP convergence Protocol G-BGP Analysis & simulation results Concluding remarks

Properties of G-BGP G-BGP eliminates all fault-agnostic instability Consequently, in case the destination fail-stops, G-BGP converges with no distribution-inherent instability as well (and thus has no unnecessary route changes) G-BGP always asymptotically improves BGP convergence speed and achieves optimal speed in several scenarios in cases where the shortest-path-first policy is used, G-BGP asymptotically improves BGP convergence speed (except in scenarios where BGP is already optimal, e.g., node join) and achieves optimal speed in several scenarios

Simulation results We implemented G-BGP in SSFNet, a network simulator with standard- conforming BGP implementations Our simulations with realistic Internet-type networks show an order of magnitude improvement in convergence stability and speed

An example: when a destination fail-stops Convergence speed: time to converge Convergence stability: the number of unnecessary route changes

Outline Network model & fault model Nature of instability during BGP convergence Protocol G-BGP Analysis & simulation results Concluding remarks

Concluding remarks Eliminating fault-agnostic instability significantly improves BGP convergence speed and achieves optimal speed in common scenarios (e.g., node/link fail-stop) Open issues how to characterize and reduce the impact of distribution-inherent instability how to deal with high-frequency unanticipated faults (such as Internet worm attack)