The Need for End-to-End Evaluation of Cloud Availability

Slides:

Advertisements

Similar presentations

Challenges in Making Tomography Practical

Advertisements

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

LIFEGUARD: Practical Repair of Persistent Route Failures Ethan Katz-Bassett (USC), Colin Scott (UW/UCB), David Choffnes, Italo Cunha (UW), Valas Valancius,

Trinocular: Understanding Internet Reliability Through Adaptive Probing Lin Quan, John Heidemann, Yuri Pradkin USC/Information Sciences Institute SIGCOMM,

2.1 Understanding Network Failures 1.0 Understanding Network Failures (program overview) 2.0 Intro to ping 2.1 Usage Intro (Strybd prototype) 2.2 Lab.

SPANStore: Cost-Effective Geo-Replicated Storage Spanning Multiple Cloud Services Zhe Wu, Michael Butkiewicz, Dorian Perkins, Ethan Katz-Bassett, Harsha.

1 Estimating Shared Congestion Among Internet Paths Weidong Cui, Sridhar Machiraju Randy H. Katz, Ion Stoica Electrical Engineering and Computer Science.

Detecting Traffic Differentiation in Backbone ISPs with NetPolice Ying Zhang Zhuoqing Morley Mao Ming Zhang.

Identifying Performance Bottlenecks in CDNs through TCP-Level Monitoring Peng Sun Minlan Yu, Michael J. Freedman, Jennifer Rexford Princeton University.

1 Defragmenting DHT-based Distributed File Systems Jeffrey Pang, Srinivasan Seshan Carnegie Mellon University Phillip B. Gibbons, Michael Kaminsky Intel.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Informed Detour Selection Helps Reliability Boulat A. Bash.

TCP over ad hoc networks Ad Hoc Networks will have to be interfaced with the Internet. As such backward compatibility is a big issue. One might expect.

Communications in ISTORE Dan Hettena. Communication Goals Goals: Fault tolerance through redundancy Tolerate any single hardware failure High bandwidth.

Exploring Tradeoffs in Failure Detection in P2P Networks Shelley Zhuang, Ion Stoica, Randy Katz Sahara Retreat January, 2003.

Beyond the perimeter: the need for early detection of Denial of Service Attacks John Haggerty,Qi Shi,Madjid Merabti Presented by Abhijit Pandey.

FIRST 2002 John Kristoff - DePaul University 1 UDP Scanning John Kristoff DePaul University Chicago, IL

Ningning HuCarnegie Mellon University1 A Measurement Study of Internet Bottlenecks Ningning Hu (CMU) Joint work with Li Erran Li (Bell Lab) Zhuoqing Morley.

End-to-End Issues. Route Diversity  Load balancing o Per packet splitting o Per flow splitting  Spill over  Route change o Failure o policy  Route.

A victim-centric peer-assisted framework for monitoring and troubleshooting routing problems.

Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.

Internet Quarantine: Requirements for Containing Self-Propagating Code David Moore et. al. University of California, San Diego.

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.

An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.

1 Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas.

Lecture 7: Performance Issues with Virtualization Xiaowei Yang (Duke University)

WhoWas: A Platform for Measuring Web Deployments on IaaS Clouds Liang Wang *, Antonio Nappa +, Juan Caballero +, Thomas Ristenpart *, Aditya Akella * *

CLOUD COMPUTING & COST MANAGEMENT S. Gurubalasubramaniyan, MSc IT, MTech Presented by.

22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.

Improving the Reliability of Internet Paths with One-hop Source Routing Krishna Gummadi, Harsha Madhyastha Steve Gribble, Hank Levy, David Wetherall Department.

CS An Overlay Routing Scheme For Moving Large Files Su Zhang Kai Xu.

Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,

On the Scale and Performance of Cooperative Web Proxy Caching University of Washington Alec Wolman, Geoff Voelker, Nitin Sharma, Neal Cardwell, Anna Karlin,

MonNet – a project for network and traffic monitoring Detection of malicious Traffic on Backbone Links via Packet Header Analysis Wolfgang John and Tomas.

Tony McGregor RIPE NCC Visiting Researcher The University of Waikato DAR Active measurement in the large.

MIS Week 4 Site:

Team 6: (DDoS) The Amazon Cloud Attack Kevin Coleman, Jeffrey Starker, Karthik Rangarajan, Paul Beresuita, Arunabh Verma and Amay Singhal.

1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.

N. Hu (CMU)L. Li (Bell labs) Z. M. Mao. (U. Michigan) P. Steenkiste (CMU) J. Wang (AT&T) Infocom 2005 Presented By Mohammad Malli PhD student seminar Planete.

1 Evaluating NGI performance Matt Mathis

On Improving the Efficiency and Manageability of NotVia Ang Li †, Pierre Francois ‡, and Xiaowei Yang † † UCIrvine ‡ Université catholique de Louvain CoNext.

© 2006 Andreas Haeberlen, MPI-SWS 1 Monarch: A Tool to Emulate Transport Protocol Flows over the Internet at Large Andreas Haeberlen MPI-SWS / Rice University.

1 On the Impact of Route Monitor Selection Ying Zhang* Zheng Zhang # Z. Morley Mao* Y. Charlie Hu # Bruce M. Maggs ^ University of Michigan* Purdue University.

USING SWITCHES TO PROGRAM A CANDY SORTER DESIGNED FOR USE WITH LMS EV3 PROGRAMMING AND BUILDING ENVIRONMENT.

PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.

Reverse Traceroute Ethan Katz-Bassett, Harsha V. Madhyastha, Vijay K. Adhikari, Colin Scott, Justine Sherry, Peter van Wesep, Arvind Krishnamurthy, Thomas.

Internet Quarantine: Requirements for Containing Self-Propagating Code

CIS 700-5: The Design and Implementation of Cloud Networks

Decoding Major Internet Outages in 2017

Peering at the Internet’s Frontier: A First Look at ISP Interconnectivity in Africa Arpit Gupta, Matt Calder, Nick Feamster, Marshini Chetty, Enrico.

Vocabulary byte - The technical term for 8 bits of data.

Measuring IPv6 Performance

Pastry Scalable, decentralized object locations and routing for large p2p systems.

An Introduction to Cloud Computing

Testing in Production Key to Data Driven Quality

Large Distributed Systems

Wireshark Lab#3.

IPv6 Performance (revisited)

On the Scale and Performance of Cooperative Web Proxy Caching

Using Switches to Program A candy Sorter

Measured Impact of Crooked Traceroute

RealProct: Reliable Protocol Conformance Testing with Real Nodes for Wireless Sensor Networks Junjie Xiong

Pong: Diagnosing Spatio-Temporal Internet Congestion Properties

Troubleshooting beyond what you understand

Content Distribution Networks

Using Switches to Program A candy Sorter

CherryPick: Adaptively Unearthing the Best

Lecture 21, Computer Networks (198:552)

Content Delivery and Remote DNS services

EE 122: Lecture 22 (Overlay Networks)

Resilient Networks CSE561.

Presentation transcript:

The Need for End-to-End Evaluation of Cloud Availability Zi Hu1,2, Liang Zhu1,2, Calvin Ardi1,2, Ethan Katz-Bassett1, Harsha Madhyastha3, John Heidemann1,2, Minlan Yu1, 1. USC 2. ISI 3. UCR 11/17/2018

Cloud usage is big and growing But is cloud always available? 11/17/2018

Need to understand cloud availability Users care Many applications in the cloud: Facebook, Gmail, Dropbox, etc. Businesses care Data/Apps on cloud needs to be always available Providers care 5-minute outage costs Google half million in revenue[1] only a few providers, but real competition 11/17/2018 [1] http://venturebeat.com/2013/08/16/3-minute-outage-costs-google-545000-in-revenue/

Prior work relies on ICMP probes Amogh Dhamdhere, et al, CoNEXT, 2007. Netdiagnoser: Troubleshooting Network Unreachabilities Using End-to-End Probes and Routing Data. Zheng Zhang, et al, SIGCOMM, 2008. iSPY: Detecting IP Prefix Hijacking on My Own. Ethan Katz-Bassett, et al, NSDI 2008. Studying Black Holes in the Internet with Hubble. Lin Quan, et al, SIGCOMM 2013. Trinocular: Understanding Internet Reliability through Adaptive Probing. … But can we trust ICMP? 11/17/2018

Cloud needs end-to-end measurement Customers The cloud ICMP tests only this part HTTP tests the whole path Front end Back end ICMP broken here Storage HTTP Gateway Plus ICMP filtering, rate limiting …. cloud is much more complicated: load balancers, RAID … All must work 11/17/2018

Our contributions Show importance of retries Compare ways to measure cloud availability Network vs. app-level (ICMP vs. HTTP) Show end-to-end measurements are necessary ICMP can over- and under-estimate 11/17/2018

Methodology Probing Method (2 probing methods: ICMP vs. HTTP) Target (2 types of targets: VMs and storage 3 vendors for storage: Amazon, MS, Google) HTTP: fetch a 1KB file (end to end) VM: 8 Amazon EC2 rounds of 10/11 minutes each Additional 2/8 retries upon a failure Storage: 7 Amazon S3 2 Google Storage 7 Microsoft Azure ICMP: ping hostname (network-level) ISI&UW: only for validation 11/17/2018

Our contributions Show importance of retries Compare ways to measure cloud availability Network vs. app-level (ICMP vs. HTTP) Show end-to-end measurements are necessary ICMP can over- and under-estimate 11/17/2018

Why retries? Packet loss could mess up the result: random packet loss (could be as high as 1%) cloud outages are rare (<<1%) Unfair to compare HTTP with ICMP w/o retry HTTP, TCP has kernel retries ICMP, nothing 11/17/2018

The case for retries cloud availability k: number of tries p: packet loss rate Outage: all k tries fail. With k=1 try and p=1% loss rate, 1% probability of false positive cloud availability Light Green is unreadable. Need retries to eliminate the noise of packet loss 11/17/2018

Our contributions Show importance of retries Compare ways to measure cloud availability Network vs. app-level (ICMP vs. HTTP) Show end-to-end measurements are necessary ICMP can over- and under-estimate 11/17/2018

Comparing network and app-Level probing Method agreement (>97% of the time) No outage/ Provider outage, e.g. power outage … Method disagreement (<3%, but still large compared to rare cloud outage rates) Outage inside cloud ICMP filtering/rate limiting Storage Back end Front end Gateway The cloud Customers ICMP HTTP ICMP tests only this part HTTP tests the whole path 11/17/2018

Showing the underlying data each column of data shows one round 24-hour boundaries each pair of rows shows ICMP and HTTP observations from one VP: blue top is ICMP lower red is HTTP Color represents the percentage of failed probes. Light color => probe succeed. Medium colors => some tries fail. Dark color => all tries fail dark blue diamond => ICMP outage dark red square => HTTP outage White => control node fail. 11/17/2018

Agreement between ICMP and HTTP Power outage at Amazon EC2 (Singapore) confirmed by operators Case study of agreement No outage/provider outage: ICMP and HTTP report consistent result 11/17/2018

Disagreement between ICMP and HTTP (case 1) Three VPs report an ICMP-only outage ICMP: down; HTTP: up Amazon EC2 (N. California) tcpdump shows ICMP probes reach the VM => filtering happens on the return path Case study of disagreement ICMP overestimates outages 11/17/2018

Disagreement between ICMP and HTTP (case 2) All VPs report a HTTP-only outage at Amazon S3 (Tokyo) ICMP: up; HTTP: down confirmed by operators Case study of disagreement ICMP underestimates outages 11/17/2018

Conclusion We have shown We should use end-to-end probes Retries are needed to eliminate noise ICMP is suspect can over- /under- estimate cloud availability. We should use end-to-end probes Interested in our project? Visit https://ant.isi.edu/availability 11/17/2018