Download presentation
Presentation is loading. Please wait.
1
Why does the Cloud stop computing?
Lessons from hundreds of service outages Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar
2
SoCC '16 Oct '16
3
Outages Bugs 2 years ago @ SoCC ’14
SoCC '16 Oct '16 Bugs Outages 2 years SoCC ’14 Study of bugs in datacenter distributed systems (Hadoop, HBase, etc.)
4
Public reports! Headline news and post-mortem reports Pros/cons
SoCC '16 Oct '16 Public reports! Headline news and post-mortem reports Providers’ transparency Untapped information Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance
5
COS: Cloud Outage Study
SoCC '16 Oct '16 COS: Cloud Outage Study ? 32 services 597 outages between ~70% report downtimes ~60% report root causes
6
SoCC '16 Oct '16
7
Downtime/year On average Worst year 5-nine availability?
SoCC '16 Oct '16 Downtime/year On average 6% services do not reach 99% availability (>88 hours) 78% not reach 99.9% (>8.8 hours) Worst year 31% not reach 99% 81% not reach 99.9% 5-nine availability? It’s just a dream? Hours
8
Root causes (sorted by count)
SoCC '16 Oct '16 Root causes (sorted by count)
9
Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Upgrade Involves multi-layers “a code push behaved differently in widespread use than it had during testing” To understand/reproduce, need full ecosystem
10
Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Human mistakes Rare now (vs. 10 years ago) Config/Upgrade software bugs Bugs in automation process Similar issues? But root cause origins are different
11
Config vs. Upgrade Research
SoCC '16 Oct '16 Config vs. Upgrade Research Upgrade #1, need more research? Paper count in last few years Challenges: Multi-layer Full ecosystem needed Multi-year? Reproducible bugs from industry (benchmarks)? Conference Config papers Upgrade ASPLOS 1 ATC 6 2 DSN 8 EuroSys 3 NSDI OSDI 4 SOSP … Total 27
12
Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Bugs What types of bugs lead to outages? Why are not masked? (pls. see paper) “Cascading” bugs
13
SoCC '16 Oct '16 “DynamoDB Storage servers query the metadata service for their membership” “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” “As a result, the storage servers were unable to obtain their membership data, and removed themselves from taking requests” Storage servers Metadata service Remove self Timeout Busy
14
Data collection servers Memory leak
SoCC '16 Oct '16 “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” “data collection servers … had a failure” “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers … “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Data collection servers Memory leak Failure
15
SoCC '16 Oct '16 (more in the paper)
16
Where is the SPOF? Redundancies, redundancies, redundancies!
SoCC '16 Oct '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?
17
Failure recovery chain
SoCC '16 Oct '16 Failure recovery chain Failure Detection Failover Backups
18
Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail
19
Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Incomplete error/failure detection Undetected (specific type of) memory leaks Load spikes of authentication requests “an unexpected hardware behavior”
20
Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Failover/recovery that fails Bad PLC fails to activate backup power generators Failed network switch failover DC failover fails due to cold cache problems Recovery/re-mirroring storm
21
Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Multiple failures! Double failures of power, network, storage or server components Diverse failures: network+server; storage+fibre cut Cascading bugs … … that caused many/all redundancies to fail
22
COS Database: ? Email us / Check our website
SoCC '16 Oct '16 COS Database: ? us / Check our website More correlations between … Root cause & downtime Service maturity & downtime Root cause & impacts Root cause & fixes Etc.
23
Conclusion Features and failures are racing with each other
SoCC '16 Oct '16 Conclusion Features and failures are racing with each other “Biggest/worst cloud outages of 20YY” – a new year’s tradition Hope COS tells the cause Many more examples/details in the papers
24
Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu
SoCC '16 Oct '16 Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu
25
EXTRA
26
Manually extract outage “metadata” Classifications:
SoCC '16 Oct '16 Manually extract outage “metadata” Classifications:
27
SoCC '16 Oct '16 A service outage implies an unplanned unavailability of partial or full features of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.
28
#Outages/year On average Worst Year
SoCC '16 Oct '16 #Outages/year On average 1/3 of the services, at least 3 unplanned outages per year Worst Year (between ’09-’14) ½ of the services, at least 4 unplanned outages per year
29
Downtime by root cause (sorted by median downtime) COS @ SoCC '16
Oct '16 Downtime by root cause (sorted by median downtime)
30
Maturity helps? Does service maturity help? Based on outage count:
SoCC '16 Oct '16 Maturity helps? Does service maturity help? Based on outage count: In 2014, 24 outages occurred from 9-yr old services
31
Maturity helps? Based on downtime:
SoCC '16 Oct '16 Maturity helps? Based on downtime: In 2014, 267 hours of downtime from 17-yr old services More mature more popular more users more complex
32
Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Load Spikes of non-monitored requests User requests (monitored) Database index accesses Authentication requests (cryptographic consumption) Misconfiguration Ex: traffic redirection Take-away: be careful with traffic-related code/configs Recovery feedback loop
33
Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Cross (dependencies) Amazon Web Services Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine Azure Xbox Live and “52 other services” Google DC (co-location) Google Gmail, Search, Drive, Youtube (40% drop of internet traffic for 5 mins)
34
Studies of failures, enough?
SoCC '16 Oct '16
35
Studies of failures, enough?
SoCC '16 Oct '16 Not all report “d”owntimes Most study only a few services (data behind company walls)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.