1 Isolating Wide-Area Network Faults with Baywatch Colin Scott With Professor Ethan Katz-Bassett, Dave Choffnes, Italo Cunha, Arvind Krishnamurthy, and.

Slides:



Advertisements
Similar presentations
Challenges in Making Tomography Practical
Advertisements

Nick Feamster Research: Network security and operations Teaching CS 7260 in Spring 2007 CS 7001 Mini-projects: –
Theory Lunch. 2 Problem Areas Network Virtualization for Experimentation and Architecture –Embedding problems –Economics problems (markets, etc.) Network.
Minimizing Probing Cost for Detecting Interface Failures: Algorithms and Scalability Analysis Hung Nguyen (Univ. of Adelaide, Australia) Renata Teixeira.
World IPv6 Day - What did we learn? Robert Kisteleki
COS 461 Fall 1997 Routing COS 461 Fall 1997 Typical Structure.
Internet Control Protocols Savera Tanwir. Internet Control Protocols ICMP ARP RARP DHCP.
LIFEGUARD: Practical Repair of Persistent Route Failures Ethan Katz-Bassett (USC), Colin Scott (UW/UCB), David Choffnes, Italo Cunha (UW), Valas Valancius,
NETS Training Troubleshooting Scot Colburn and David Mitchell 5/1/07.
CSCI 4550/8556 Computer Networks Comer, Chapter 23: An Error Reporting Mechanism (ICMP)
User-level Internet Path Diagnosis Ratul Mahajan, Neil Spring, David Wetherall and Thomas Anderson Designed by Yao Zhao.
How do Networks work – Really The purposes of set of slides is to show networks really work. Most people (including technical people) don’t know Many people.
James 1:5 If any of you lacks wisdom, he should ask God, who gives generously to all without finding fault, and it will be given to him.
Internet Control Message Protocol (ICMP)
CPSC 441 Tutorial - Network Tools 1 Network Tools CPSC 441 – Computer Communications Tutorial.
Measurement in the Internet. Outline Internet topology Bandwidth estimation Tomography Workload characterization Routing dynamics.
© 2007 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.1 Computer Networks and Internets with Internet Applications, 4e By Douglas.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Internet Control Message Protocol (ICMP) Shivkumar Kalyanaraman Rensselaer Polytechnic Institute.
A victim-centric peer-assisted framework for monitoring and troubleshooting routing problems.
Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.
Introduction to InfoSec – Recitation 12 Nir Krakowski (nirkrako at post.tau.ac.il) Itamar Gilad (itamargi at post.tau.ac.il)
1 Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas.
Support Protocols and Technologies. Topics Filling in the gaps we need to make for IP forwarding work in practice – Getting IP addresses (DHCP) – Mapping.
Lecture 1 Internet CPE 401 / 601 Computer Network Systems slides are modified from Dave Hollinger and Daniel Zappala Lecture 1 Introduction.
Formal checkings in networks James Hongyi Zeng with Peyman Kazemian, George Varghese, Nick McKeown.
CCNA Introduction to Networking 5.0 Rick Graziani Cabrillo College
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Troubleshooting Your Network Networking for Home and Small Businesses.
ICMP (Internet Control Message Protocol) Computer Networks By: Saeedeh Zahmatkesh spring.
30-1 Computer Networking The Internet Hourglass Model The physical layer is how machines are physically connected to each other... FTP HTTPNVTFTP.
Towards Highly Reliable Enterprise Network Services via Inference of Multi-level Dependencies Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth.
Improving the Reliability of Internet Paths with One-hop Source Routing Krishna Gummadi, Harsha Madhyastha Steve Gribble, Hank Levy, David Wetherall Department.
RSC Part I: Introduction Redes y Servicios de Comunicaciones Universidad Carlos III de Madrid These slides are, mainly, part of the companion slides to.
Chapter 4. After completion of this chapter, you should be able to: Explain “what is the Internet? And how we connect to the Internet using an ISP. Explain.
Healing the Web: An Overview of CoDeeN & Related Projects Vivek Pai, Larry Peterson + many others Princeton University.
POSTECH DP&NM Lab. Internet Traffic Monitoring and Analysis: Methods and Applications (1) 4. Active Monitoring Techniques.
workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.
Guide to TCP/IP, Second Edition1 Guide To TCP/IP, Second Edition Chapter 4 Internet Control Message Protocol (ICMP)
PC1 LAN GW SP RTR1 SP RTR2 DST 4 * 25 ms 21 ms dst [ ] 4. A third packet is sent with TTL=3, which decrements at each hop, and expires after RTR2,
TDTS21: Advanced Networking Lecture 7: Internet topology Based on slides from P. Gill and D. Choffnes Revised 2015 by N. Carlsson.
Bob Knowledge Plane -- Scaling of the WHY App Bob Braden, ISI 24 Sept 03.
A comparison of overlay routing and multihoming route control Hayoung OH
Internet Protocols. Address Resolution IP Addresses are not recognized by hardware. If we know the IP address of a host, how do we find out the hardware.
A Light-Weight Distributed Scheme for Detecting IP Prefix Hijacks in Real-Time Lusheng Ji†, Joint work with Changxi Zheng‡, Dan Pei†, Jia Wang†, Paul Francis‡
9: Troubleshooting Your Network
Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area.
1 Mean Time to Innocence Your Dashboards are Green – but your end users are still complaining. Now What? Phil Stanhope October 2015.
Yaping Zhu with: Jennifer Rexford (Princeton University) Aman Shaikh and Subhabrata Sen (ATT Research) Route Oracle: Where Have.
Magellan: A Tool for Unicast Fault Isolation Cengiz Alaettinoglu Packet Design LLC Ramesh Govindan Information Sciences Institute John Mehringer Information.
1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang.
Confidential Rapid Troubleshooting for Data, VoIP, and Video VoIP Performance Manager.
Internet Flow By: Terry Hernandez. Getting from the customers computer onto the internet Internet Browser
Reading for next class No new reading, but there will be a quiz Ch 4.1 – 4.23: Internet applications.
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.
Reverse Traceroute Ethan Katz-Bassett, Harsha V. Madhyastha, Vijay K. Adhikari, Colin Scott, Justine Sherry, Peter van Wesep, Arvind Krishnamurthy, Thomas.
Lecture#6:Connectivity Verification
Connectivity Verification
Monitoring Persistently Congested Internet Links
Content Distribution Networks
COMPUTER NETWORKS CS610 Lecture-33 Hammad Khalid Khan.
Month 2002 doc.: IEEE /xxxr0 November 2004 Routing and Rbridges
CS4470 Computer Networking Protocols
Introduction to Networking
Troubleshooting IP Communications
Lecture#7:Connectivity Verification
ECE 671 – Lecture 16 Content Distribution Networks
Chandrika Jayant Ethan Katz-Bassett
Anupam Das , Nikita Borisov
Lecture#6:Connectivity Verification
Lecture 8, Computer Networks (198:552)
Hari Balakrishnan Hari Balakrishnan Computer Networks
Presentation transcript:

1 Isolating Wide-Area Network Faults with Baywatch Colin Scott With Professor Ethan Katz-Bassett, Dave Choffnes, Italo Cunha, Arvind Krishnamurthy, and Tom Anderson

2 A Quick Survey Raise your hand if you used the Internet / …since you got to this room? …in the last hour? …today?

3 3 We Need the Internet to Be Reliable We increasingly depend on the Internet: – Yesterday: , web browsing, e-commerce – Today: Skype, Google Docs, NetFlix – Tomorrow: Thin clients + cloud, traffic control, outpatient medical monitoring,… So, we expect it to operate reliably: – High availability – Good performance Does it achieve these goals?

4 Outages happen. They’re expensive, embarrassing and annoying They take a long time to fix – Alert – Troubleshoot – Repair Lack of good tools for wide-area isolation

5 Many outages and most are partial Number of VPs Approx 90% are partial

6 And can be surprisingly long-lasting Approx 10% last 10 minutes or longer

7 But where are the outages? Can’t fix a problem if you don’t know where State of the art: traceroute – Only tells part of the story – Even with control of source and destination – Especially without control of destination

8 Example confusion (12/16/10) User 1 1 Wireless_Broadband_Router.home [ ] 2 L100.BLTMMD-VFTTP-40.verizon-gni.net [ ] 3 G BLTMMD-LCR-04.verizon-gni.net [ ] 4 so PHIL-BB-RTR2.verizon-gni.net [ ] 5 so RES-BB-RTR2.verizon-gni.net [ ] 6 0.ae2.BR2.IAD8.ALTER.NET [ ] 7 ae7.edge1.washingtondc4.level3.net [ ] 8 vlan80.csw3.Washington1.Level3.net [ ] 9 ae ebr2.Washington1.Level3.net [ ] 10 * * * Request timed out.L100.BLTMMD-VFTTP-40.verizon-gni.netG BLTMMD-LCR-04.verizon-gni.netso PHIL-BB-RTR2.verizon-gni.netso RES-BB-RTR2.verizon-gni.net0.ae2.BR2.IAD8.ALTER.NETae7.edge1.washingtondc4.level3.netvlan80.csw3.Washington1.Level3.netae ebr2.Washington1.Level3.net “It seems traffic attempting to pass through Level3's network in the Washington, DC area is getting lost in the abyss. Here's a trace from VZ residential FIOS to – Outages.org listwww.level3.com User 1: Broken link is in DC

9 Example confusion (12/16/10) “It seems traffic attempting to pass through Level3's network in the Washington, DC area is getting lost in the abyss. Here's a trace from VZ residential FIOS to – Outages.org listwww.level3.com Is this even the same problem? What if it’s on the reverse path? (and paths aren’t symmetric) User 1: Broken link is in DC User 2: It’s in Denver? User ( ) 2 l100.washdc-vfttp-47.verizon-gni.net ( ) l100.washdc-vfttp-47.verizon-gni.net 3 g washdc-lcr-07.verizon-gni.net ( )g washdc-lcr-07.verizon-gni.net 4 so lcc1-res-bb-rtr1-re1.verizon-gni.net ( )so lcc1-res-bb-rtr1-re1.verizon-gni.net 5 0.ae1.br1.iad8.alter.net ( )0.ae1.br1.iad8.alter.net 6 ae6.edge1.washingtondc4.level3.net ( ) ae6.edge1.washingtondc4.level3.net 7 vlan90.csw4.washington1.level3.net ( ) vlan90.csw4.washington1.level3.net 8 ae ebr1.washington1.level3.net ( ) ae ebr1.washington1.level3.net 9 ae-8-8.ebr1.washington12.level3.net ( )ae-8-8.ebr1.washington12.level3.net 10 ae ebr2.washington12.level3.net ( )ae ebr2.washington12.level3.net 11 ae-6-6.ebr2.chicago2.level3.net ( ) ae-6-6.ebr2.chicago2.level3.net 12 ae ebr1.chicago2.level3.net ( )ae ebr1.chicago2.level3.net 13 ae-3-3.ebr2.denver1.level3.net ( ) ae-3-3.ebr2.denver1.level3.net 14 ge-9-1.hsa1.denver1.level3.net ( ) ge-9-1.hsa1.denver1.level3.net ( ) ( ) 17 * * *

10 System for wide-area failure isolation Goal: Detect and isolate outages online What kind of outages? – Long-lasting: not fixing itself (needs some help) – Avoidable: requires path diversity, no stub ASes – High impact: outages in PoPs affecting many paths What kind of isolation? – IP-link How quickly? – Within seconds or small numbers of minutes

11 What we want out of isolation Direction (forward or reverse) Narrowly determine location (link or ASN) Alternate working paths (facilitates remediation) Online (allows for immediate action) So, how do we accomplish this?

12 Detecting outages with pings Source Target Source Ping?

13 Detecting outages with pings Source Target Source

14 traceroute doesn’t work Target TTL=1 Source

15 traceroute doesn’t work S R1 R1: Time Exceeded Target

16 traceroute doesn’t work S R1 Target TTL=2

17 traceroute doesn’t work S R1 Target R2: Time Exceeded R2

18 traceroute doesn’t work S R1 Target ? TTL=3 traceroute doesn’t work S R1 Target R2

19 Spoofed traceroute ftw S R1 Target S R1 Target S’ R2 R2: Time Exceeded

20 Spoofed traceroute ftw S R1 Target S R1 Target S’ R2 R3: Time Exceeded R3

21 Spoofed traceroute ftw S R1 Target S R1 Target S’ R2 R4: Time Exceeded R3 R4 Target: Pong

22 What now? SS R1 Target S R1 Target S’ R2 R3 R4

23 Measure working reverse paths SS R1 Target S R1 Target S’ R3 R4 OK, somewhere on R3’s reverse path But where? SS R1 Target S R1 Target S’ R2 R3 R4

24 VPs Targets Historical path atlas Each host traceroutes each target

25 VPs Targets Historical path atlas Each host measures reverse paths

26 Ping historical hops SS R1 Target S R1 Target S’ R2 R3 R4

27 Putting it all together Find spoofing VPs that reach target Determine working direction (if any) – Forward: issue spoofed forward traceroute – Reverse: VPs spoof towards target as source, issue spoofed reverse traceroute Failure cases – Forward-only: spoof traceroute – Reverse-only: reverse traceroute from each fwd hop, ping historical hops – Bi-directional: spoof traceroute

28 Results Baywatch has been running for 4 months 12 geographically distributed VPs monitoring: – CloudFront PoPs (16) Correlate with app-layer outages – Popular PoPs wrt # intersecting paths (83) And targets on “other” side of PoPs (185) – PlanetLab hosts (76) Ground-truth isolation

29 Results Location (~2500 total) – PL/Mlab: 1241 – Top 100: 1220 – CloudFront: 38 Duration: Average is 453 seconds Directionality – Forward: 860 – Reverse: 130 – Bi-directional: 439 – The rest were indeterminate (different path, fixed by time of isolation, …)

30 Evaluation Coverage – How much of the network can we monitor? – How precise is isolation? Effectiveness – When affecting CDN, try application layer – Corroborate with NANOG – Post to outages.org

31 Summary System for wide-are failure isolation – Detection at fine granularity – Algorithm for isolation Historical, rapidly refreshed path atlas Spoofed probing to measure during outage Pings to infer reachability

32 Questions?

33

34 Reverse traceroutes Reverse path info generally requires – IP options support along the path – Limited spoofing – A lot of trial and error

35 Simple (real) example Normal traceroute * * * 9. * * * 10. * * * 11. * * * 12. * * * Spoofed traceroute pl2.bit.uoit.ca ( )pl2.bit.uoit.ca( plgmu4.ite.gmu.edu to pl2.bit.uoit.ca