Why does the Cloud stop computing?

Slides:



Advertisements
Similar presentations
1 Perspectives from Operating a Large Scale Website Dennis Lee VP Technical Operations, Marchex.
Advertisements

A Ridiculously Easy & Seriously Powerful SQL Cloud Database Itamar Haber AVP Ops & Solutions.
Networking Essentials Lab 3 & 4 Review. If you have configured an event log retention setting to Do Not Overwrite Events (Clear Log Manually), what happens.
The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.
SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.
Local Touch – Global Reach Avoiding the Chaos Monkey Brent Stineman – National Cloud Solution Specialist.
Availability in Globally Distributed Storage Systems
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
Modern Distributed Systems Design – Security and High Availability 1.Measuring Availability 2.Highly Available Data Management 3.Redundant System Design.
What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
National Manager Database Services
What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.
Computer Measurement Group, India Reliable and Scalable Data Streaming in Multi-Hop Architecture Sudhir Sangra, BMC Software Lalit.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
Application and Usage of Cloud Computing and Data Security
1 Perspectives from Operating a Large Scale Website Dennis Lee.
by Marc Comeau. About A Webmaster Developing a website goes far beyond understanding underlying technologies Determine your requirements.
Global NetWatch Copyright © 2003 Global NetWatch, Inc. Factors Affecting Web Performance Getting Maximum Performance Out Of Your Web Server.
Virtualization. Virtualization  In computing, virtualization is a broad term that refers to the abstraction of computer resources  It is "a technique.
Introduction to Cloud Computing
NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.
Team 6: (DDoS) The Amazon Cloud Attack Kevin Coleman, Jeffrey Starker, Karthik Rangarajan, Paul Beresuita, Arunabh Verma and Amay Singhal.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
FireProof. The Challenge Firewall - the challenge Network security devices Critical gateway to your network Constant service The Challenge.
This presentation can be distributed under a Creative Commons License.
Why Cloud Architecture is Different! Michael Stiefel Architecting For Failure.
Distributed systems [Fall 2015] G Lec 1: Course Introduction.
Important Questions Moving to the Cloud (Or even splitting the environment) Stephen Wynkoop ( )SSWUG.ORG.
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
Sai Zhang Michael D. Ernst Google Research University of Washington
1 Object-Oriented Analysis and Design with the Unified Process Figure 13-1 Implementation discipline activities.
Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.
Business Continuity Planning for OPEN OPEN Development Conference September 18, 2008 Ravi Rajaram IT Development Manager.
Virtual Machine Movement and Hyper-V Replica
High-Availability MySQL with DR:BD and Heartbeat: MTV Japan mobile services ©2008 MTV Networks Japan K.K.
A Seminar On. What is Cloud Computing? Distributed computing on internet Or delivery of computing service over the internet. Eg: Yahoo!, GMail, Hotmail-
INTRODUCTION TO WEB HOSTING
Introduction to: The Architecture of the Internet
Sources of Failure in the Public Switched Telephone Network
NERC Published Lessons Learned Summary
Fault Tolerance Comparison
Managing Multi-User Databases
High Availability Linux (HA Linux)
Embracing Failure: A Case for Recovery-Oriented Computing
IOT Critical Impact on DC Design
Google search not working on pc. Google Google is basically an American company Google is specialises in internet services Google have internet services.
CSC 591/791 Reliable Software Systems
Large Distributed Systems
Maximum Availability Architecture Enterprise Technology Centre.
A Technical Overview of Microsoft® SQL Server™ 2005 High Availability Beta 2 Matthew Stephen IT Pro Evangelist (SQL Server)
BTEC NCF Dip in Comp - Unit 15 Website Development Lesson 05 – Website Performance Mr C Johnston.
12 Steps to Useful Software Metrics
Towards Reliable Application Deployment in the Cloud
3.2 Virtualisation.
Google search not working on pc Google.
Introduction of Week 6 Assignment Discussion
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Introduction to: The Architecture of the Internet
Arrested by the CAP Handling Data in Distributed Systems
What’s new in SQL Server 2016 Availability Groups
Outline Virtualization Cloud Computing Microsoft Azure Platform
INFO 344 Web Tools And Development
Introduction to: The Architecture of the Internet
Continuity Peter Smith, Director of Legal Sales
Troubleshooting Techniques(*)
Presentation Title Global-scale systems that know when they are behaving badly NSF workshop on grand challenges in distributed systems Jeff Mogul, HP.
Building global and highly-available services using Windows Azure
Harrison Howell CSCE 824 Dr. Farkas
Presentation transcript:

Why does the Cloud stop computing? Lessons from hundreds of service outages Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar

COS @ SoCC '16 Oct '16

Outages Bugs 2 years ago @ SoCC ’14 COS @ SoCC '16 Oct '16 Bugs Outages 2 years ago @ SoCC ’14 Study of bugs in datacenter distributed systems (Hadoop, HBase, etc.)

Public reports! Headline news and post-mortem reports Pros/cons COS @ SoCC '16 Oct '16 Public reports! Headline news and post-mortem reports Providers’ transparency Untapped information Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance

COS: Cloud Outage Study COS @ SoCC '16 Oct '16 COS: Cloud Outage Study ? 32 services 597 outages between 2009-2015 ~70% report downtimes ~60% report root causes

COS @ SoCC '16 Oct '16

Downtime/year On average Worst year 5-nine availability? COS @ SoCC '16 Oct '16 Downtime/year On average 6% services do not reach 99% availability (>88 hours) 78% not reach 99.9% (>8.8 hours) Worst year 31% not reach 99% 81% not reach 99.9% 5-nine availability? It’s just a dream? Hours

Root causes (sorted by count) COS @ SoCC '16 Oct '16 Root causes (sorted by count)

Interesting Root Causes COS @ SoCC '16 Oct '16 Interesting Root Causes Upgrade Involves multi-layers “a code push behaved differently in widespread use than it had during testing” To understand/reproduce, need full ecosystem

Interesting Root Causes COS @ SoCC '16 Oct '16 Interesting Root Causes Human mistakes Rare now (vs. 10 years ago) Config/Upgrade software bugs Bugs in automation process Similar issues? But root cause origins are different

Config vs. Upgrade Research COS @ SoCC '16 Oct '16 Config vs. Upgrade Research Upgrade #1, need more research? Paper count in last few years  Challenges: Multi-layer Full ecosystem needed Multi-year? Reproducible bugs from industry (benchmarks)? Conference Config papers Upgrade ASPLOS 1 ATC 6 2 DSN 8 EuroSys 3 NSDI OSDI 4 SOSP … Total 27

Interesting Root Causes COS @ SoCC '16 Oct '16 Interesting Root Causes Bugs What types of bugs lead to outages? Why are not masked? (pls. see paper) “Cascading” bugs

COS @ SoCC '16 Oct '16 “DynamoDB Storage servers query the metadata service for their membership” “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” “As a result, the storage servers were unable to obtain their membership data, and removed themselves from taking requests” Storage servers Metadata service Remove self Timeout Busy

Data collection servers Memory leak COS @ SoCC '16 Oct '16 “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” “data collection servers … had a failure” “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers … “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Data collection servers Memory leak Failure

COS @ SoCC '16 Oct '16 (more in the paper)

Where is the SPOF? Redundancies, redundancies, redundancies! COS @ SoCC '16 Oct '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?

Failure recovery chain COS @ SoCC '16 Oct '16 Failure recovery chain Failure Detection Failover Backups

Imperfect failure recovery chain COS @ SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail

Imperfect failure recovery chain COS @ SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Incomplete error/failure detection Undetected (specific type of) memory leaks Load spikes of authentication requests “an unexpected hardware behavior”

Imperfect failure recovery chain COS @ SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Failover/recovery that fails Bad PLC fails to activate backup power generators Failed network switch failover DC failover fails due to cold cache problems Recovery/re-mirroring storm

Imperfect failure recovery chain COS @ SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Multiple failures! Double failures of power, network, storage or server components Diverse failures: network+server; storage+fibre cut Cascading bugs … … that caused many/all redundancies to fail

COS Database: ? Email us / Check our website COS @ SoCC '16 Oct '16 COS Database: ? Email us / Check our website More correlations between … Root cause & downtime Service maturity & downtime Root cause & impacts Root cause & fixes Etc.

Conclusion Features and failures are racing with each other COS @ SoCC '16 Oct '16 Conclusion Features and failures are racing with each other “Biggest/worst cloud outages of 20YY” – a new year’s tradition Hope COS tells the cause Many more examples/details in the papers

Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu COS @ SoCC '16 Oct '16 Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu

EXTRA

Manually extract outage “metadata” Classifications: COS @ SoCC '16 Oct '16 Manually extract outage “metadata” Classifications:

COS @ SoCC '16 Oct '16 A service outage implies an unplanned unavailability of partial or full features of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.

#Outages/year On average Worst Year COS @ SoCC '16 Oct '16 #Outages/year On average 1/3 of the services, at least 3 unplanned outages per year Worst Year (between ’09-’14) ½ of the services, at least 4 unplanned outages per year

Downtime by root cause (sorted by median downtime) COS @ SoCC '16 Oct '16 Downtime by root cause (sorted by median downtime)

Maturity helps? Does service maturity help? Based on outage count: COS @ SoCC '16 Oct '16 Maturity helps? Does service maturity help? Based on outage count: In 2014, 24 outages occurred from 9-yr old services

Maturity helps? Based on downtime: COS @ SoCC '16 Oct '16 Maturity helps? Based on downtime: In 2014, 267 hours of downtime from 17-yr old services More mature  more popular  more users  more complex

Interesting Root Causes COS @ SoCC '16 Oct '16 Interesting Root Causes Load Spikes of non-monitored requests User requests (monitored) Database index accesses Authentication requests (cryptographic consumption) Misconfiguration Ex: traffic redirection Take-away: be careful with traffic-related code/configs Recovery feedback loop

Interesting Root Causes COS @ SoCC '16 Oct '16 Interesting Root Causes Cross (dependencies) Amazon Web Services Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine Azure Xbox Live and “52 other services” Google DC (co-location) Google Gmail, Search, Drive, Youtube (40% drop of internet traffic for 5 mins)

Studies of failures, enough? COS @ SoCC '16 Oct '16

Studies of failures, enough? COS @ SoCC '16 Oct '16 Not all report “d”owntimes Most study only a few services (data behind company walls)