Why does the Cloud stop computing?

Why does the Cloud stop computing?
Lessons from hundreds of service outages Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar

SoCC '16 Oct '16

Outages Bugs 2 years ago @ SoCC ’14
SoCC '16 Oct '16 Bugs Outages 2 years SoCC ’14 Study of bugs in datacenter distributed systems (Hadoop, HBase, etc.)

Public reports! Headline news and post-mortem reports Pros/cons
SoCC '16 Oct '16 Public reports! Headline news and post-mortem reports Providers’ transparency Untapped information Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance

COS: Cloud Outage Study
SoCC '16 Oct '16 COS: Cloud Outage Study ? 32 services 597 outages between ~70% report downtimes ~60% report root causes

SoCC '16 Oct '16

Downtime/year On average Worst year 5-nine availability?
SoCC '16 Oct '16 Downtime/year On average 6% services do not reach 99% availability (>88 hours) 78% not reach 99.9% (>8.8 hours) Worst year 31% not reach 99% 81% not reach 99.9% 5-nine availability? It’s just a dream? Hours

Root causes (sorted by count)
SoCC '16 Oct '16 Root causes (sorted by count)

Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Upgrade Involves multi-layers “a code push behaved differently in widespread use than it had during testing” To understand/reproduce, need full ecosystem

SoCC '16 Oct '16 Interesting Root Causes Human mistakes Rare now (vs. 10 years ago) Config/Upgrade software bugs Bugs in automation process Similar issues? But root cause origins are different

Config vs. Upgrade Research
SoCC '16 Oct '16 Config vs. Upgrade Research Upgrade #1, need more research? Paper count in last few years  Challenges: Multi-layer Full ecosystem needed Multi-year? Reproducible bugs from industry (benchmarks)? Conference Config papers Upgrade ASPLOS 1 ATC 6 2 DSN 8 EuroSys 3 NSDI OSDI 4 SOSP … Total 27

SoCC '16 Oct '16 Interesting Root Causes Bugs What types of bugs lead to outages? Why are not masked? (pls. see paper) “Cascading” bugs

SoCC '16 Oct '16 “DynamoDB Storage servers query the metadata service for their membership” “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” “As a result, the storage servers were unable to obtain their membership data, and removed themselves from taking requests” Storage servers Metadata service Remove self Timeout Busy

Data collection servers Memory leak
SoCC '16 Oct '16 “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” “data collection servers … had a failure” “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers … “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Data collection servers Memory leak Failure

SoCC '16 Oct '16 (more in the paper)

Where is the SPOF? Redundancies, redundancies, redundancies!
SoCC '16 Oct '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?

Failure recovery chain
SoCC '16 Oct '16 Failure recovery chain Failure Detection Failover Backups

Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail

SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Incomplete error/failure detection Undetected (specific type of) memory leaks Load spikes of authentication requests “an unexpected hardware behavior”

SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Failover/recovery that fails Bad PLC fails to activate backup power generators Failed network switch failover DC failover fails due to cold cache problems Recovery/re-mirroring storm

SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Multiple failures! Double failures of power, network, storage or server components Diverse failures: network+server; storage+fibre cut Cascading bugs … … that caused many/all redundancies to fail

COS Database: ? Email us / Check our website
SoCC '16 Oct '16 COS Database: ? us / Check our website More correlations between … Root cause & downtime Service maturity & downtime Root cause & impacts Root cause & fixes Etc.

Conclusion Features and failures are racing with each other
SoCC '16 Oct '16 Conclusion Features and failures are racing with each other “Biggest/worst cloud outages of 20YY” – a new year’s tradition Hope COS tells the cause Many more examples/details in the papers

Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu
SoCC '16 Oct '16 Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu

Manually extract outage “metadata” Classifications:
SoCC '16 Oct '16 Manually extract outage “metadata” Classifications:

SoCC '16 Oct '16 A service outage implies an unplanned unavailability of partial or full features of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.

#Outages/year On average Worst Year
SoCC '16 Oct '16 #Outages/year On average 1/3 of the services, at least 3 unplanned outages per year Worst Year (between ’09-’14) ½ of the services, at least 4 unplanned outages per year

Downtime by root cause (sorted by median downtime) COS @ SoCC '16
Oct '16 Downtime by root cause (sorted by median downtime)

Maturity helps? Does service maturity help? Based on outage count:
SoCC '16 Oct '16 Maturity helps? Does service maturity help? Based on outage count: In 2014, 24 outages occurred from 9-yr old services

Maturity helps? Based on downtime:
SoCC '16 Oct '16 Maturity helps? Based on downtime: In 2014, 267 hours of downtime from 17-yr old services More mature  more popular  more users  more complex

SoCC '16 Oct '16 Interesting Root Causes Load Spikes of non-monitored requests User requests (monitored) Database index accesses Authentication requests (cryptographic consumption) Misconfiguration Ex: traffic redirection Take-away: be careful with traffic-related code/configs Recovery feedback loop

SoCC '16 Oct '16 Interesting Root Causes Cross (dependencies) Amazon Web Services Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine Azure Xbox Live and “52 other services” Google DC (co-location) Google Gmail, Search, Drive, Youtube (40% drop of internet traffic for 5 mins)

Studies of failures, enough?
SoCC '16 Oct '16

Studies of failures, enough?
SoCC '16 Oct '16 Not all report “d”owntimes Most study only a few services (data behind company walls)

Why does the Cloud stop computing?

Similar presentations

Presentation on theme: "Why does the Cloud stop computing?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Why does the Cloud stop computing?

Similar presentations

Presentation on theme: "Why does the Cloud stop computing?"— Presentation transcript:

Similar presentations

About project

Feedback