Zooming in on Wide-area Latencies to a Global Cloud Provider

Slides:



Advertisements
Similar presentations
Multihoming and Multi-path Routing
Advertisements

Ningning HuCarnegie Mellon University1 Optimizing Network Performance In Replicated Hosting Peter Steenkiste (CMU) with Ningning Hu (CMU), Oliver Spatscheck.
Estimating TCP Latency Approximately with Passive Measurements Sriharsha Gangam, Jaideep Chandrashekar, Ítalo Cunha, Jim Kurose.
1 Internet Path Inflation Xenofontas Dimitropoulos.
DYNAMICS OF PREFIX USAGE AT AN EDGE ROUTER Kaustubh Gadkari, Dan Massey and Christos Papadopoulos 1.
1 Traffic Engineering for ISP Networks Jennifer Rexford IP Network Management and Performance AT&T Labs - Research; Florham Park, NJ
1 Network Tomography Venkat Padmanabhan Lili Qiu MSR Tab Meeting 22 Oct 2001.
E2E Routing Behavior in the Internet Vern Paxson Sigcomm 1996 Slides are adopted from Ion Stoica’s lecture at UCB.
Network Monitoring for Internet Traffic Engineering Jennifer Rexford AT&T Labs – Research Florham Park, NJ 07932
The War Between Mice and Elephants By Liang Guo (Graduate Student) Ibrahim Matta (Professor) Boston University ICNP’2001 Presented By Preeti Phadnis.
1 Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
A measurement study of vehicular internet access using in situ Wi-Fi networks Vladimir Bychkovsky, Bret Hull, Allen Miu, Hari Balakrishnan, and Samuel.
On the Power of Off-line Data in Approximating Internet Distances Danny Raz Technion - Israel Institute.
Ao-Jan Su, David R. Choffnes, Fabián E. Bustamante and Aleksandar Kuzmanovic Department of EECS Northwestern University Relative Network Positioning via.
Network Sensitivity to Hot-Potato Disruptions Renata Teixeira (UC San Diego) with Aman Shaikh (AT&T), Tim Griffin(Intel),
Optimizing Cost and Performance in Online Service Provider COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
CustomerSegment and workloads Your Datacenter Active Directory SharePoint SQL Server.
Mar 1, 2004 Multi-path Routing CSE 525 Course Presentation Dhanashri Kelkar Department of Computer Science and Engineering OGI School of Science and Engineering.
HAIR: Hierarchical Architecture for Internet Routing Anja Feldmann TU-Berlin / Deutsche Telekom Laboratories Randy Bush, Luca Cittadini, Olaf Maennel,
Aditya Akella The Performance Benefits of Multihoming Aditya Akella CMU With Bruce Maggs, Srini Seshan, Anees Shaikh and Ramesh Sitaraman.
Resilient Overlay Networks By David Andersen, Hari Balakrishnan, Frans Kaashoek, and Robert Morris MIT RON Paper from ACM Oct Advanced Operating.
A Light-Weight Distributed Scheme for Detecting IP Prefix Hijacks in Real-Time Lusheng Ji†, Joint work with Changxi Zheng‡, Dan Pei†, Jia Wang†, Paul Francis‡
April 4th, 2002George Wai Wong1 Deriving IP Traffic Demands for an ISP Backbone Network Prepared for EECE565 – Data Communications.
1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.
Multihoming Performance Benefits: An Experimental Evaluation of Practical Enterprise Strategies Aditya Akella, CMU Srinivasan Seshan, CMU Anees Shaikh,
On the Impact of Clustering on Measurement Reduction May 14 th, D. Saucez, B. Donnet, O. Bonaventure Thanks to P. François.
Change Is Hard: Adapting Dependency Graph Models For Unified Diagnosis in Wired/Wireless Networks Lenin Ravindranath, Victor Bahl, Ranveer Chandra, David.
Network Computing Laboratory Load Balancing and Stability Issues in Algorithms for Service Composition Bhaskaran Raman & Randy H.Katz U.C Berkeley INFOCOM.
Internet Traffic Engineering Motivation: –The Fish problem, congested links. –Two properties of IP routing Destination based Local optimization TE: optimizing.
Jeremy Johnson. XYZ.com measured from netVMG Product Overview—Flow Control Platform.
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.
Reverse Traceroute Ethan Katz-Bassett, Harsha V. Madhyastha, Vijay K. Adhikari, Colin Scott, Justine Sherry, Peter van Wesep, Arvind Krishnamurthy, Thomas.
Junchen Jiang, Rajdeep Das, Ganesh Ananthanarayanan, Philip A
Yiting Xia, T. S. Eugene Ng Rice University
CS 3700 Networks and Distributed Systems
Lecture 13 – Network Mapping
CIS 700-5: The Design and Implementation of Cloud Networks
Peering at the Internet’s Frontier: A First Look at ISP Interconnectivity in Africa Arpit Gupta, Matt Calder, Nick Feamster, Marshini Chetty, Enrico.
CS 3700 Networks and Distributed Systems
New Directions in Routing
Jennifer Rexford Princeton University
P4P : Provider Portal for (P2P) Applications Haiyong Xie, Y
Jian Wu (University of Michigan)
Improved Algorithms for Network Topology Discovery
A Comparison of Overlay Routing and Multihoming Route Control
Implementation of GPU based CCN Router
Crowd Density Estimation for Public Transport Vehicles
CS590B/690B Detecting Network Interference
Cloud Computing.
No Direction Home: The True cost of Routing Around Decoys
On the Scale and Performance of Cooperative Web Proxy Caching
SCOPE: Scalable Consistency in Structured P2P Systems
Measured Impact of Crooked Traceroute
Pong: Diagnosing Spatio-Temporal Internet Congestion Properties
COS 561: Advanced Computer Networks
SCREAM: Sketch Resource Allocation for Software-defined Measurement
COS 561: Advanced Computer Networks
COS 561: Advanced Computer Networks
AWS Cloud Computing Masaki.
COS 561: Advanced Computer Networks
BGP Policies Jennifer Rexford
BGP Interactions Jennifer Rexford
COS 561: Advanced Computer Networks
Visualization of Temporal Difference of BGP Routing Information
Route Metric Proposal Date: Authors: July 2007 Month Year
An Empirical Evaluation of Wide-Area Internet Bottlenecks
Stable and Practical AS Relationship Inference with ProbLink
A Comparison of Overlay Routing and Multihoming Route Control
Zooming in on Wide-area Latencies to a Global Cloud Provider
Presentation transcript:

Zooming in on Wide-area Latencies to a Global Cloud Provider Yuchen Jin, Sundararajan Renganathan, Ganesh Ananthanarayanan, Junchen Jiang, Venkat Padmanabhan, Manuel Schroder, Matt Calder, Arvind Krishnamurthy

TL;DR When clients experience high Internet latency degradation to cloud services, where does the fault lie? Cloud services: high latency  lower user engagement  1. BlameIt: A tool for Internet fault localization to ease the lives of network engineers with automation & hints 2. BlameIt uses a hybrid approach (passive + active) Use passive end-to-end measurements as much as possible Issue selected active probes for high-priority incidents 3. Production deployment of passive BlameIt at Azure Correctly localizes the fault in all 88 incidents with manual reports 72X lesser probing overhead

Public Internet communication is weak Congestions inside/between AS(es) Path updates inside an AS AS-level path changes Maintenance issues in client’s ISP Intra-DC and inter-DC communications have seen rapid improvement DC2 edge4 Cloud Network DC1 edge3 edge1 edge2 Internet segment is the weak link! (little visibility/control)

Client ISP (e.g., Comcast AS) Problem at cloud end Investigate server issue Alternate egress Re-route around the faulty AS Contact other AS’s network operations center (NOC) Contact ISP if issue is widespread Cloud (e.g., Azure, AWS) Client ISP (e.g., Comcast AS) When Internet perf is bad (RTT inflates), where in the path is to blame?

Passive analysis of end-to-end latency When Internet perf is bad (RTT inflates), where in the path is to blame? Passive analysis of end-to-end latency Network tomography for connected graphs Under-constrained due to insufficient coverage of paths Active probing for hop-by-hop latencies Frequent probes from vantage points worldwide  Prohibitively expensive at scale Network tomography [JASA’96, Statistical Science’04], Boolean tomography [IMC’10], Ghita et al. [CoNEXT’11], VIA [Sigcomm’16], 007 [NSDI’18] iPlane [OSDI’06], WhyHigh [IMC’09], Trinocular [Sigcomm’13], Sibyl [NSDI’16], Odin [NSDI’18]

BlameIt: A hybrid approach Coarse-grained blame assignment using passive measurements CLOUD SEGMENT MIDDLE SEGMENT CLIENT SEGMENT Fine-grained active traceroutes only for (high-priority) middle-segment blames

Outline Coarse-grained fault localization with passive measurements Fine-grained localization with active probes Evaluation

Architecture Hundreds of edge locations Hundreds of millions of clients TCP handshake ACK SYN SYN/ACK client IP, device type, timestamp, RTT Data analytics cluster

Quartet Quartet: {client IP /24, cloud location, mobile (or) non-mobile device, 5 minute time bucket} Better spatial and temporal fidelity > 90% quartets have at least 10 RTT samples A quartet is “bad” if its average RTT is over the badness threshold Badness thresholds: RTT targets varying across regions and device connection types. Quartet: {10.0.6.0/24, NYC Cloud, mobile, time window=1}  average RTT: 34ms {10.0.6.2, NYC Cloud, mobile, 02:00:33}: 32ms {10.0.6.7, NYC Cloud, mobile, 02:02:25}: 34ms {10.0.6.132, NYC Cloud, mobile, 02:04:49}: 36ms

BlameIt for localizing Internet fault 1. Identify “bad” quartets. {IP-/24, cloud location, mobile (or) non-mobile device, time bucket} 2. For each bad quartet, Start from the cloud, keep passing the blame downstream if no consensus τ = 80% If (> τ) quartets to the cloud have RTTs > cloud’s expected RTT If (> τ) quartets sharing the middle segment (BGP path) have RTTs > middle’s expected RTT Good RTT to another cloud Chicago NYC Blame the middle segment Blame the cloud Ambiguous Else Blame the client If not sufficient RTT samples Insufficient

Key empirical observations Only one AS is usually at fault E.g., Either the client or a middle AS is at fault, but not both simultaneously Smaller “failure set” is more likely than a larger set E.g., If all clients connecting to a cloud location see bad RTTs, it’s the cloud’s fault (and not all the clients being bad simultaneously) Hierarchical elimination of the culprit starting with the cloud, and stop when we are sufficiently confident to blame a segment

Learning cloud/middle expected RTT Each cloud location’s expected RTT is learnt from previous 14 days’ median RTT Each middle segment’s expected RTT is learnt from previous 14 days’ median RTT cloud location’s expected RTT is 40ms P(RTT) > τ (=80%) quartets to the cloud have RTT higher than its expected RTT (40ms) RTT(ms) 30 35 40 50 55

Outline Coarse-grained fault localization with passive measurements Fine-grained localization with active probes Evaluation

BlameIt: A hybrid approach Coarse-grained blame assignment using passive measurements CLOUD SEGMENT MIDDLE SEGMENT CLIENT SEGMENT Fine-grained active traceroutes only for (high-priority) middle-segment blames

Approach for localizing middle-segment issues Background traceroute: obtain the picture prior to the fault. On-demand traceroute: triggered by the passive phase of BlameIt Blame the AS with greatest increase in contribution! AS 8075 AS m1 AS m2 AS client Background traceroute Contribution of AS m1 = 6 – 4 = 2ms 4ms 6ms 8ms 9ms On-demand traceroute Contribution of AS m1 = 60 – 4 = 56ms 4ms 60ms 62ms 64ms

Key observations for optimizing probing volume Internet paths are relatively stable Background traceroutes need to be updated only when BGP path changes Not all middle-segment issues are worth investigating Most issues are fleeting! > 60% issues last <= 5 mins Prioritize traceroutes for long-standing incidents!

Optimizing background traceroutes Issued periodically to each BGP path seen from each cloud location We keep it to two per day hitting a “sweet spot” of high localization accuracy and low probing overhead Triggered by BGP churn i.e. BGP path change at border routers Whenever AS-level path to a client prefix changes at border routers

Concentration of issues Optimizing on-demand traceroutes Approximate damage of user experience: number of affected users (distinct IP addresses) X the duration of the RTT degradation “client-time product” Concentration of issues If ranked by client-time product, top 20% middle segments cover 80% damages of all incidents BlameIt uses estimated client-time product to prioritize middle-segment issues

Outline Coarse-grained fault localization with passive measurements Fine-grained localization with active probes Evaluation

BlameIt correctly pinned the blame in all the incidents Evaluation highlights We compare the accuracy of BlameIt’s result to 88 incidents in production having labels from manual investigations done by Azure BlameIt correctly pinned the blame in all the incidents

Blame assignments in production Blame assignments worldwide over a month Fractions are generally stable Cloud segment issues account for <4% of bad quartets Investigated on priority by Azure “Ambiguous” and “Insufficient” have a large fraction

Real-world incident: Peering Fault A high-priority latency issue affecting many customers in the US BlameIt caught it and correctly blamed the middle segment Issue was due to changes in a peering AS with which Azure peers at multiple locations (Other) Monitoring Systems Why didn’t it catch the issue? How BlameIt succeeded? Periodic traceroutes from Azure clients Clients issuing traceroutes weren’t affected (limited number of clients) Considers all the clients hitting Azure (using passive measurements) Monitoring System Why it didn’t catch the issue? How BlameIt avoids this problem Users download web objects to measure latency Deployed conservatively to limit overheads Passive data used to issue selective active probes Monitoring System Why it didn’t catch the issue? How BlameIt avoids this problem Monitors RTTs to Azure in each {AS, metro} No single {AS, metro} was badly affected though many clients affected countrywide Goes more fine-grained by using BGP-paths and client IP-/24s BlameIt is able notice widespread increases in latency without prohibitive overheads

Finding the best background traceroute frequency Central tradeoff: traceroute overhead vs AS-level localization accuracy Experiment setup: Traceroutes from 22 Azure locations to 23,000 BGP prefixes for two weeks Accuracy metric: Relative localization accuracy with most fine-grained scheme as ground truth 72X cheaper! Probing scheme can be configured by the operators

Hybrid (passive + active) approach BlameIt Summary  Ease the work of network engineers with automation & hints to investigate latency degradation Deployment at Azure produces results with high accuracy at low overheads Hybrid (passive + active) approach