PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.

Slides:



Advertisements
Similar presentations
Data-Plane Accountability with In-Band Path Diagnosis Murtaza Motiwala, Nick Feamster Georgia Tech Andy Bavier Princeton University.
Advertisements

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Addressing the Network – IPv4 Network Fundamentals – Chapter 6.
Ningning HuCarnegie Mellon University1 Optimizing Network Performance In Replicated Hosting Peter Steenkiste (CMU) with Ningning Hu (CMU), Oliver Spatscheck.
1 Network Measurements in Overlay Networks Richard Cameron Craddock School of Electrical and Computer Engineering Georgia Institute of Technology.
Part IV: BGP Routing Instability. March 8, BGP routing updates  Route updates at prefix level  No activity in “steady state”  Routing messages.
1 Experimental Study of Internet Stability and Wide-Area Backbone Failure Craig Labovitz, Abha Ahuja Merit Network, Inc Presented by Changchun Zou.
End-to-End Routing Behavior in the Internet Vern Paxson Presented by Zhichun Li.
Detecting Traffic Differentiation in Backbone ISPs with NetPolice Ying Zhang Zhuoqing Morley Mao Ming Zhang.
CPSC 441 Tutorial - Network Tools 1 Network Tools CPSC 441 – Computer Communications Tutorial.
Next Step In Signaling (NSIS) and Internet Routing Dynamics Charles Shen and Henning Columbia University in the City of New York Internet.
Measurement in the Internet. Outline Internet topology Bandwidth estimation Tomography Workload characterization Routing dynamics.
E2E Routing Behavior in the Internet Vern Paxson Sigcomm 1996 Slides are adopted from Ion Stoica’s lecture at UCB.
Routing Jennifer Rexford Advanced Computer Networks Tuesdays/Thursdays 1:30pm-2:50pm.
A victim-centric peer-assisted framework for monitoring and troubleshooting routing problems.
Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Lecture Week 3 Introduction to Dynamic Routing Protocol Routing Protocols and Concepts.
Analyzing Peer-to-Peer Traffic Across Large Networks Jia Wang Joint work with Subhabrata Sen AT&T Labs - Research.
1 Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas.
CCNA Introduction to Networking 5.0 Rick Graziani Cabrillo College
Reading Report 14 Yin Chen 14 Apr 2004 Reference: Internet Service Performance: Data Analysis and Visualization, Cross-Industry Working Team, July, 2000.
Guide to TCP/IP, Third Edition
On the Power of Off-line Data in Approximating Internet Distances Danny Raz Technion - Israel Institute.
Improving the Reliability of Internet Paths with One-hop Source Routing Krishna Gummadi, Harsha Madhyastha Steve Gribble, Hank Levy, David Wetherall Department.
CS An Overlay Routing Scheme For Moving Large Files Su Zhang Kai Xu.
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 2 Module 8 TCP/IP Suite Error and Control Messages.
Healing the Web: An Overview of CoDeeN & Related Projects Vivek Pai, Larry Peterson + many others Princeton University.
POSTECH DP&NM Lab. Internet Traffic Monitoring and Analysis: Methods and Applications (1) 4. Active Monitoring Techniques.
On AS-Level Path Inference Jia Wang (AT&T Labs Research) Joint work with Z. Morley Mao (University of Michigan, Ann Arbor) Lili Qiu (University of Texas,
Reducing Transient Disconnectivity using Anomaly-Cognizant Forwarding Andrey Ermolinskiy, Scott Shenker University of California – Berkeley and ICSI.
Guide to TCP/IP, Second Edition1 Guide To TCP/IP, Second Edition Chapter 4 Internet Control Message Protocol (ICMP)
PC1 LAN GW SP RTR1 SP RTR2 DST 4 * 25 ms 21 ms dst [ ] 4. A third packet is sent with TTL=3, which decrements at each hop, and expires after RTR2,
A Routing Underlay for Overlay Networks Akihiro Nakao Larry Peterson Andy Bavier SIGCOMM’03 Reviewer: Jing lu.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 6: Static Routing Routing and Switching Essentials.
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
A Light-Weight Distributed Scheme for Detecting IP Prefix Hijacks in Real-Time Lusheng Ji†, Joint work with Changxi Zheng‡, Dan Pei†, Jia Wang†, Paul Francis‡
Detection of Routing Loops and Analysis of Its Causes Sue Moon Dept. of Computer Science KAIST Joint work with Urs Hengartner, Ashwin Sridharan, Richard.
1 An Error Reporting Mechanism (ICMP). 2 IP Semantics IP is best-effort Datagrams can be –Lost –Delayed –Duplicated –Delivered out of order –Corrupted.
1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.
1 Chapter 23 Internetworking Part 3 (Control Messages, Error Handling, ICMP)
N. Hu (CMU)L. Li (Bell labs) Z. M. Mao. (U. Michigan) P. Steenkiste (CMU) J. Wang (AT&T) Infocom 2005 Presented By Mohammad Malli PhD student seminar Planete.
Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 8 TCP/IP Suite Error and Control Messages.
© 2008 Cisco Systems, Inc. All rights reserved.Cisco ConfidentialPresentation_ID 1 Chapter 6: Static Routing Routing and Switching Essentials.
COS 420 Day 15. Agenda Finish Individualized Project Presentations on Thrusday Have Grading sheets to me by Friday Group Project Discussion Goals & Timelines.
End-to-End Routing Behavior in the Internet Vern Paxson Presented by Sankalp Kohli and Patrick Wong.
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang.
1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 8 TCP/IP Suite Error and Control Messages.
1 Chapter 23 Internetworking Part 3 (Control Messages, Error Handling, ICMP)
Jeremy Johnson. XYZ.com measured from netVMG Product Overview—Flow Control Platform.
Monitoring Persistently Congested Internet Links
Jian Wu (University of Michigan)
Network Tools and Utilities
COMPUTER NETWORKS CS610 Lecture-33 Hammad Khalid Khan.
Controlling the Cost of Reliability in Peer-to-Peer Overlays
CS4470 Computer Networking Protocols
Chapter 2: Static Routing
Detection of Routing Loops and Analysis of Its Causes
RESOLVING IP ALIASES USING DISTRIBUTED SYSTEMS
CS 457 – Lecture 12 Routing Spring 2012.
Chapter 2: Static Routing
Measured Impact of Crooked Traceroute
COS 561: Advanced Computer Networks
EE 122: Lecture 22 (Overlay Networks)
An Empirical Evaluation of Wide-Area Internet Bottlenecks
Exploiting Routing Redundancy via Structured Peer-to-Peer Overlays
A Comparison of Overlay Routing and Multihoming Route Control
Achieving Resilient Routing in the Internet
Presentation transcript:

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton University

2 Motivation Routing anomalies are common on Internet  Maintenance  Power outage  Fiber cut  Misconfiguration  … Anomalies can affect end-to-end performance  Packet losses  Packet delays  Disconnectivities

3 Background Anomaly detection and diagnosis are nontrivial  Asymmetric paths  Failure information propagation  Highly varied durations  Limited coverage

4 Contributions New techniques for  Anomaly detection  Anomaly isolation  Anomaly classification Large-scale study of anomalies  Broad coverage  High detection rate, low overhead  Characterization of anomalies  End-to-end effects  Benefits to host service

5 Outline State of the Art PlanetSeer Components  MonD – passive monitoring  ProbeD – active probing Anomaly Analysis  Loop-based anomaly  Non-loop anomaly Bypassing Anomalies Summary

6 State of the Art Routing messages  BGP: AS-level diagnosis  IS-IS, OSPF: Within single ISP Router/link traffic statistics  SNMP, NetFlow: proprietary End-to-end measurement  Ping, traceroute

7 End-to-End Probing All-pairs probes among n nodes  O(n^2) measurement cost  Not scalable as n grows

8 Key Observation Combine passive monitoring with active probing Peer-to-Peer (P2P), Content Distribution Network (CDN)  Large client population  Geographically distributed nodes  Large traffic volume  Highly diverse paths The traffic generated by the services reveals information about the network.

9 Our Approach Host service  CDN Components  Passive monitoring  Active probing Advantages  Low overhead  Wide coverage Client A C B R1 R2

10 MonD: Anomaly Detection Anomaly indicators  Time-to-live (TTL) change Routing change  n consecutive timeouts (n = 4 in current system) Idling period of 3 to 16 seconds most congestion periods < 220ms

11 ProbeD Operation Baseline probes  When a new IP appears  From local node Forward probes  When a possible anomaly detected  From multiple nodes (including local node) Reprobes  At 0.5, 1.5, 3.5 and 7.5 hours later  From local node

12 ProbeD Groups 353 nodes, 145 sites, 30 groups  According to geographic location  One traceroute per group

13 Estimating Scope Which routers might be affected?  Routers which possibly change their next hops  Traceroutes from multiple locations can narrow the scope rara rbrb rcrc rdrd Client Local ProbeD Remote ProbeD

14 Path Diversity Monitoring Period: 02/2004 – 05/2004 Unique IPs: 887,521 Traversed ASes: 10, ASes 215 ASes 1392 ASes 1420 ASes ASes Core Edge

15 Confirming Anomalies Reported anomalies  2,259,588 Conditions  Loops  Route change  Partial unreachability  ICMP unreachable Very conservative confirmation Undecided 22% Non- anomaly 66% Anomaly 12%

16 Confirmed Anomaly Breakdown Confirmed anomalies  271,898  2 per minute  100x more Temp anomalies  Inconsistent probes Temp loop 1% Path Change 44% Fwd Outage 9% Other Outage 23% Persist Loop 7% Temp Anomalies 16%

17 Scope of Loops How many routers or ASes are involved?  Temp loops involve more routers than persistent loops  97% persistent loops and 51% temp loops contain 2 hops 1% persist loops cross ASes 15% temp loops cross ASes

18 Distribution of Loops Many persistent loops in tier-3, few in tier-1 Worst 10% of tier-1 ASes – implications for largest ISPs  20% traffic  35% persistent loops

19 Duration of Persistent Loops How long do persistent loops last?  Either resolve quickly or last for an extended period

20 Scope of Forward Anomalies How many routers or ASes are affected?  60% outages within 1 hops  75% outages and 68% changes within 4 hops 78% outages within 2 ASes 57% changes within 2 ASes

21 Location of Forward Anomalies How close are the anomalies to the edges of the network?  44% outages at the last hop  72% outages and 40% changes within 4 hops

22 Distribution of Forward Anomalies Which ASes are affected?  Tier-1 ASes most stable  Tier-3 ASes most likely to be affected

23 Overlay Routing Use alternate path when default path fails source destination intermediate

24 Bypassing Anomalies How useful is overlay routing for bypassing failures?  Effective in 43% of 62,815 failures, lower than previous studies  32% bypass paths inflate RTTs by more than a factor of two

25 Summary Confirm 272,000 anomalies in 3 months Persistent and temporary loops  Persistent loops narrower scope, either resolve quickly or last for a long time Path outages and changes  Outages closer to edge, narrower scope Anomaly distribution  Skewed. Tier-1 most stable. Tier-3 most problematic. Overlay routing  Bypasses 43% failures, latency inflation

26 More Information In the paper  More details about anomaly characteristics  End-to-end impacts  Classification methodology  Optimizations to reduce overheads & improve confirmation rate

27 Classifying Anomalies Temporary vs. persistent loops  Whether exit loops at maximum hop Path changes vs. outages  Changes: follow different paths to clients  Outages: stop at intermediate hops ProbeD Client

28 Non-anomalies  Ultrashort anomalies  Path-based TTL  Aggressive timeout

29 Identifying Forward Outages Forward outages  Route change  ICMP dest unreachable  Forward timeout

30 Loop Effect on RTT How do loops affect RTTs?  Loops can incur high latency inflation

31 Loop Effect on Loss Rate How do loops affect loss rates?  65% temporary and 55% persistent loops preceded by loss rates exceeding 30%

32 Forward Anomaly Effect on RTT How do forward anomalies affect RTTs?  Outages and changes can incur latency inflation  Outages have more negative effect on RTTs

33 Forward Anomaly Effect on Loss Rate How do forward anomalies affect loss rates?  45% outages and 40% changes preceded by loss rates exceeding 30%

34 Reducing Measurement Overhead Can we reduce the number of probes?  15 probes can achieve the same accuracy in 80% cases  Flow-based TTL

35 Traffic Breakdown By Tiers