Routing Measurements: Three Case Studies Jennifer Rexford.

Slides:



Advertisements
Similar presentations
Internet Routing (COS 598A) Today: Interdomain Routing Convergence Jennifer Rexford Tuesdays/Thursdays.
Advertisements

1 Interdomain Traffic Engineering with BGP By Behzad Akbari Spring 2011 These slides are based on the slides of Tim. G. Griffin (AT&T) and Shivkumar (RPI)
Internet Routing Instability Craig Labovitz, G. Robert Malan, Farham Jahanian University of Michigan Presented By Krishnanand M Kamath.
Part IV: BGP Routing Instability. March 8, BGP routing updates  Route updates at prefix level  No activity in “steady state”  Routing messages.
Border Gateway Protocol Ankit Agarwal Dashang Trivedi Kirti Tiwari.
Lecture 9 Overview. Hierarchical Routing scale – with 200 million destinations – can’t store all dests in routing tables! – routing table exchange would.
Internet Routing Instability
Fundamentals of Computer Networks ECE 478/578 Lecture #18: Policy-Based Routing Instructor: Loukas Lazos Dept of Electrical and Computer Engineering University.
1 BGP Anomaly Detection in an ISP Jian Wu (U. Michigan) Z. Morley Mao (U. Michigan) Jennifer Rexford (Princeton) Jia Wang (AT&T Labs)
1 Interdomain Routing Protocols. 2 Autonomous Systems An autonomous system (AS) is a region of the Internet that is administered by a single entity and.
End-to-End Routing Behavior in the Internet Vern Paxson Presented by Zhichun Li.
1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.
Traffic Engineering With Traditional IP Routing Protocols
Internet Routing (COS 598A) Today: Detecting Anomalies Inside an AS Jennifer Rexford Tuesdays/Thursdays.
1 Traffic Engineering for ISP Networks Jennifer Rexford IP Network Management and Performance AT&T Labs - Research; Florham Park, NJ
1 Policy-Based Path-Vector Routing Reading: Sections COS 461: Computer Networks Spring 2006 (MW 1:30-2:50 in Friend 109) Jennifer Rexford Teaching.
MIRED: Managing IP Routing is Extremely Difficult Jennifer Rexford Internet and Networking Systems AT&T Labs - Research; Florham Park, NJ
A Measurement Framework for Pin-Pointing Routing Changes Renata Teixeira (UC San Diego) with Jennifer Rexford (AT&T)
Internet Routing Instability Labovitz et al. Sigcomm 1997 Largely adopted from Ion Stoica’s slide at UCB.
Dynamics of Hot-Potato Routing in IP Networks Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
More on BGP Check out the links on politics: ICANN and net neutrality To read for next time Path selection big example Scaling of BGP.
Internet Routing (COS 598A) Today: Interdomain Traffic Engineering Jennifer Rexford Tuesdays/Thursdays.
E2E Routing Behavior in the Internet Vern Paxson Sigcomm 1996 Slides are adopted from Ion Stoica’s lecture at UCB.
Internet Routing (COS 598A) Today: Hot-Potato Routing Jennifer Rexford Tuesdays/Thursdays 11:00am-12:20pm.
Routing Jennifer Rexford Advanced Computer Networks Tuesdays/Thursdays 1:30pm-2:50pm.
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
Routing.
Interdomain Routing Jennifer Rexford Advanced Computer Networks Tuesdays/Thursdays 1:30pm-2:50pm.
Backbone Networks Jennifer Rexford COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Hot Potatoes Heat Up BGP Routing Jennifer Rexford AT&T Labs—Research Joint work with Renata Teixeira, Aman Shaikh, and.
Dynamics of Hot-Potato Routing in IP Networks Jennifer Rexford AT&T Labs—Research Joint work with Renata Teixeira, Aman.
Network Sensitivity to Hot-Potato Disruptions Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
Authors Renata Teixeira, Aman Shaikh and Jennifer Rexford(AT&T), Tim Griffin(Intel) Presenter : Farrukh Shahzad.
1 Computer Communication & Networks Lecture 22 Network Layer: Delivery, Forwarding, Routing (contd.)
Introduction to BGP.
1 Interdomain Routing (BGP) By Behzad Akbari Fall 2008 These slides are based on the slides of Ion Stoica (UCB) and Shivkumar (RPI)
CS 3700 Networks and Distributed Systems Inter Domain Routing (It’s all about the Money) Revised 8/20/15.
Using Measurement Data to Construct a Network-Wide View Jennifer Rexford AT&T Labs—Research Florham Park, NJ
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 11 Unicast Routing Protocols.
Dynamics of Hot-Potato Routing in IP Networks Jennifer Rexford AT&T Labs—Research Joint work with Renata Teixeira (UCSD),
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429/556 Introduction to Computer Networks Inter-domain routing Some slides used with.
Mike Freedman Fall 2012 COS 561: Advanced Computer Networks Control Plane.
Jennifer Rexford Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks Network.
TCOM 509 – Internet Protocols (TCP/IP) Lecture 06_a Routing Protocols: RIP, OSPF, BGP Instructor: Dr. Li-Chuan Chen Date: 10/06/2003 Based in part upon.
1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.
1 Agenda for Today’s Lecture The rationale for BGP’s design –What is interdomain routing and why do we need it? –Why does BGP look the way it does? How.
© 2005 Cisco Systems, Inc. All rights reserved. BGP v3.2—6-1 Scaling Service Provider Networks Scaling IGP and BGP in Service Provider Networks.
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
Michael Schapira, Princeton University Fall 2010 (TTh 1:30-2:50 in COS 302) COS 561: Advanced Computer Networks
Inter-domain Routing Outline Border Gateway Protocol.
1 Internet Routing 11/11/2009. Admin. r Assignment 3 2.
CS 3700 Networks and Distributed Systems
CS 3700 Networks and Distributed Systems
Jian Wu (University of Michigan)
Border Gateway Protocol
COS 561: Advanced Computer Networks
Interdomain Traffic Engineering with BGP
Routing.
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
BGP Policies Jennifer Rexford
COMP/ELEC 429/556 Introduction to Computer Networks
BGP Interactions Jennifer Rexford
COS 461: Computer Networks
BGP Instability Jennifer Rexford
Routing.
Presentation transcript:

Routing Measurements: Three Case Studies Jennifer Rexford

Motivations for Measuring the Routing System Characterizing the Internet –Internet path properties –Demands on Internet routers –Routing convergence Improving Internet health –Protocol design problems –Protocol implementation problems –Configuration errors or attacks Operating a network –Detecting and diagnosing routing problems –Traffic shifts, routing attacks, flaky equipment, …

Techniques for Measuring Internet Routing Active probing –Inject probes along path through the data plane –E.g., using traceroute Passive route monitoring –Capture control-plane messages between routers –E.g., using tcpdump or a software router –E.g., dumping the routing table on a router Injecting network events –Cause failure/recovery at planned time and place –E.g., BGP route beacon, or planned maintenance

Challenges in Measuring Routing Data vs. control plane –Understand relationship between routing protocol messages and the impact on data traffic Cause vs. effect –Identify the root cause for a change in the forwarding path or control-plane messages Visibility and representativeness –Collect routing data from many vantage points –Across many Autonomous Systems, or within Large volume of data –Many end-to-end paths –Many prefixes and update measurements

Measurement Tools: Traceroute Traceroute tool exploits TTL-limited probes –Observation of the forwarding path Useful, but introduces many challenges –Path changes –Non-participating nodes –Inaccurate, two-way measurements –Hard to map interfaces to routers and ASes source destination TTL=1 Time exceeded TTL=2 Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message

Measurement: Intradomain Route Monitoring OSPF is a flooding protocol –Every link-state advertisements sent on every link –Very helpful for simplifying the monitor Can participate in the protocol –Shared media (e.g., Ethernet) Join multicast group and listen to LSAs –Point-to-point links Establish an adjacency with a router … or passively monitor packets on a link –Tap a link and capture the OSPF packets

Measurement: Interdomain Route Monitoring Talk to operational routers using SNMP or telnet at command line (-) BGP table dumps are expensive (+) Table dumps show all alternate routes (-) Update dynamics lost (-) restricted to interfaces provided by vendors Establish a “passive” BGP session from a workstation running BGP software (+) BGP table dumps do not burden operational routers (-) Receives only best routes from BGP neighbor (+) Update dynamics captured (+) not restricted to interfaces provided by vendors BGP session over TCP

Atlanta St. Louis San Francisco Denver Cambridge Washington, D.C. Orlando Chicago Seattle Los Angeles Detroit Houston New York Phoenix San Diego Austin Philadelphia Dallas 2 Kansas City Collect BGP Data From Many Routers Route Monitor BGP is not a flooding protocol

Two Kinds of BGP Monitoring Data Wide-area, from many ASes –RouteViews or RIPE-NCC data –Pro: available from many vantage points –Con: often just one or two views per AS Single AS, from many routers –Abilene and GEANT public repositories –Proprietary data at individual ISPs –Pro: comprehensive view of a single AS –Con: limited public examples, mostly research nets

Measurement: Injecting Events Equipment failure/recovery –Unplug/reconnect the equipment –Packet filters that block all packets –Knowing when planned event will take place –Shutting down a routing-protocol adjacency Injecting route announcements –Acquire some blocks of IP addresses –Acquire a routing-protocol adjacency to a router –Announce/withdraw routes on a schedule –Beacons:

Two Papers for Today Both early measurement studies –Initially appeared at SIGCOMM’96 and ’97 –Both won the “best student paper” award –Early glimpses into the health of Internet routing –Early wave of papers on Internet measurement Differences in emphasis –Paxson96: end-to-end active probing to measure the characteristics of the data plane –Labovitz97: passive monitoring of BGP update messages from several ISPs to characterize (in)stability of the interdomain routing system

Paxson Study: Forwarding Loops Forwarding loop –Packet returns to same router multiple times May cause traceroute to show a loop –If loop lasted long enough –So many packets traverse the loopy path Traceroute may reveal false loops –Path change that leads to a longer path –Causing later probe packets to hit same nodes Heuristic solution –Require traceroute to return same path 3 times

Paxson Study: Causes of Loops Transient vs. persistent –Transient: routing-protocol convergence –Persistent: likely configuration problem Challenges –Appropriate time boundary between the two? –What about flaky equipment going up and down? –Determining the cause of persistent loops? Anecdote on recent study of persistent loops –Provider has static route for customer prefix –Customer has default route to the provider

Paxson Study: Path Fluttering Rapid changes between paths –Multiple paths between a pair of hosts –Load balancing policies inside the network Packet-based load balancing –Round-robin or random –Multiple paths for packets in a single flow Flow-based load balancing –Hash of some fields in the packet header –E.g., IP addresses, port numbers, etc. –To keep packets in a flow on one path

Paxson Study: Routing Stability Route prevalence –Likelihood of observing a particular route –Relatively easy to measure with sound sampling –Poisson arrivals see time averages (PASTA) –Most host pairs have a dominant route Route persistence –How long a route endures before a change –Much harder to measure through active probes –Look for cases of multiple observations –Typical host pair has path persistence of a week

Paxson Study: Route Asymmetry Hot-potato routingOther causes –Asymmetric link weights in intradomain routing –Cold-potato routing, where AS requests traffic enter at particular place Consequences –Lots of asymmetry –One-way delay is not necessarily half of the round-trip time Customer A Customer B multiple peering points Provider A Provider B Early-exit routing

Labovitz Study: Interdomain Routing AS-level topology –Destinations are IP prefixes (e.g., /8) –Nodes are Autonomous Systems (ASes) –Links are connections & business relationships ClientWeb server

Labovitz Study: BGP Background Extension of distance-vector routing –Support flexible routing policies –Avoid count-to-infinity problem Key idea: advertise the entire path –Distance vector: send distance metric per dest d –Path vector: send the entire path for each dest d d “d: path (2,1)” “d: path (1)” data traffic

Labovitz Study: BGP Background BGP is an incremental protocol –In theory, no update messages in steady state Two kinds of update messages –Announcement: advertising a new route –Withdrawal: withdrawing an old route Study saw an alarming number of updates –At the time, Internet had around 45,000 prefixes –Routers were exchanging 3-6 million updates/day –Sometimes as high as 30 million in a day Placing a very high load on the routers

Labovitz Study: Classifying Update Messages Analyze update messages –For each (prefix, peer) tuple –Classify the kinds of routing changes Forwarding instability –WADiff: explicit withdraw, replaced by alternate –AADiff: implict withdraw, replaced by alternate Pathological –WADup: explicit withdraw, and then reanounced –AADup: duplicate announcement –WWDup: duplicate withdrawal

Labovitz Study: Duplicate Withdrawals Time-space trade-off in router implementation –Common system building technique –Trade one resource for another –Can have surprising side effects The gory details –Ideally, you should not send a withdrawal if you never sent a neighbor a corresponding announcement –Requires remembering what update message you sent to each neighbor –Easier to just send everyone a withdrawal when your route goes away

Labovitz Study: Practical Impact “Stateless BGP” is compliant with the standard –But, it forces other routers to handle more load –So that you don’t have to maintain state –Arguably very unfair, and bad for global Internet One router vendor was largely at fault –Router vendor modified its implementation –ISPs then deployed the updated software

Labovitz Study: Still Hard to Diagnose Problems Despite having very detailed view into BGP –Some pathologies were very hard to diagnose Possible causes –Flaky equipment –Synchronization of BGP timers –Interaction between BGP and intradomain routing –Policy oscillation These topics were studied in follow-up studies –Example: study of BGP data within a large ISP –

ISP Study: Detecting Important Routing Changes Large volume of BGP updates messages –Around 2 million/day, and very bursty –Too much for an operator to manage Identify important anomalies –Lost reachability –Persistent flapping –Large traffic shifts Not the same as root-cause analysis –Identify changes and their effects –Focus on mitigation, rather than diagnosis –Diagnose causes if they occur in/near the AS

Challenge #1: Excess Update Messages A single routing change –Leads to multiple update messages –Affects routing decision at multiple routers BGP Update Grouping BGP Update Grouping EE BR EE EE BGP Updates Events Group updates for a prefix with inter-arrival < 70 seconds, and flag prefixes with changes lasting > 10 minutes. Persistent Flapping Prefixes

Determine “Event Timeout” (70, 98%) Cumulative distribution of BGP update inter-arrival time BGP beacon

Event Duration: Persistent Flapping Complementary cumulative distribution of event duration (600, 0.1%) Long Events

Detecting Persistent Flapping Significant persistent flapping –15.2% of all BGP update messages –… though a small number of destination prefixes –Surprising, especially since flap dampening is used Types of persistent flapping –Conservative flap-damping parameters (78.6%) –Policy oscillations, e.g., MED oscillation (18.3%) –Unstable interface or BGP session (3.0%)

Example: Unstable eBGP Session AT&T Peer Customer E C E B E A E D p

Challenge #2: Identify Important Events Major concerns of network operators –Changes in reachability –Heavy load of routing messages on the routers –Flow of the traffic through the network Event Classification Event Classification Events “Typed” Events No Disruption Loss/Gain of Reachability Internal Disruption Single External Disruption Multiple External Disruption Classify events by type of impact it has on the network

Event Category – “No Disruption” AT&T E A p E B E C E E AS 2 E D AS 1 No Traffic Shift “No Disruption”: each of the border routers has no traffic shift

Event Category – “Internal Disruption” AT&T E A p E B E C E E AS 2 E D AS 1 Internal Traffic Shift “Internal Disruption”: all of the traffic shifts are internal traffic shift

Event Type: “Single External Disruption” AT&T E A p E B E C E E AS 2 E D AS 1 external Traffic Shift “Single External Disruption”: traffic at one exit point shifts to other exit points

Statistics on Event Classification EventsUpdates No Disruption50.3%48.6% Internal Disruption15.6%3.4% Single External Disruption20.7%7.9% Multiple External Disruption7.4%18.2% Loss/Gain of Reachability6.0%21.9%

Challenge #3: Multiple Destinations A single routing change –Affects multiple destination prefixes Event Correlation Event Correlation “Typed” Events Clusters Group events of same type that occur close in time

Main Causes of Large Clusters: BGP Resets External BGP session resets –Failure/recovery of external BGP session –E.g., session to another large tier-1 ISP –Caused “single external disruption” events –Validated by looking at syslog reports on routers AT&T E A p E B E C E E AS 2 E D AS 1

Main Causes of Large Clusters: Hot Potatoes Hot-potato routing changes –Failure/recovery of an intradomain link –E.g., leads to changes in IGP path costs –Caused “internal disruption” events –Validated by looking at OSPF measurements ISP P E A E B E C “Hot-potato routing” = route to closest egress point

Challenge #4: Popularity of Destinations Impact of event on traffic –Depends on the popularity of the destinations Traffic Impact Prediction Traffic Impact Prediction EE BR EE EE Clusters Large Disruptions Netflow Data Weight the group of destinations by the traffic volume

ISP Study: Traffic Impact Prediction Traffic weight –Per-prefix measurements from Netflow –10% prefixes accounts for 90% of traffic Traffic weight of a cluster –The sum of “traffic weight” of the prefixes Flag clusters with heavy traffic –A few large clusters have large traffic weight –Mostly session resets and hot-potato changes

ISP Study: Summary Event Classification Event Classification “Typed” Events EE BR EE EE BGP Updates (10 6 ) BGP Update Grouping BGP Update Grouping Events Persistent Flapping Prefixes (10 1 ) (10 5 ) Event Correlation Event Correlation Clusters Frequent Flapping Prefixes (10 3 ) (10 1 ) Traffic Impact Prediction Traffic Impact Prediction EE BR EE EE Large Disruptions Netflow Data (10 1 )

Three Studies, Three Approaches End-to-end active probes –Measure and characterize the forwarding path –Identify the effects on data traffic Wide-area passive route monitoring –Measure and classify BGP routing churn –Identify pathologies and improve Internet health Intra-AS passive route monitoring –Detailed measurements of BGP within an AS –Aggregate data into small set of major events