Presentation is loading. Please wait.

Presentation is loading. Please wait.

Why does the Cloud stop computing?

Similar presentations


Presentation on theme: "Why does the Cloud stop computing?"— Presentation transcript:

1 Why does the Cloud stop computing?
Lessons from hundreds of service outages Haryadi S. Gunawi, Mingzhe Hao, Riza O. Suminto, Agung Laksono, Anang D. Satria, Jeffrey Adityatama, and Kurnia J. Eliazar

2 SoCC '16 Oct '16

3 Outages Bugs 2 years ago @ SoCC ’14
SoCC '16 Oct '16 Bugs Outages 2 years SoCC ’14 Study of bugs in datacenter distributed systems (Hadoop, HBase, etc.)

4 Public reports! Headline news and post-mortem reports Pros/cons
SoCC '16 Oct '16 Public reports! Headline news and post-mortem reports Providers’ transparency Untapped information Pros/cons + Detailed root causes + Detailed chain of failures + Downtime durations + Zero false positive -- (Very) incomplete -- (High) variance

5 COS: Cloud Outage Study
SoCC '16 Oct '16 COS: Cloud Outage Study ? 32 services 597 outages between ~70% report downtimes ~60% report root causes

6 SoCC '16 Oct '16

7 Downtime/year On average Worst year 5-nine availability?
SoCC '16 Oct '16 Downtime/year On average 6% services do not reach 99% availability (>88 hours) 78% not reach 99.9% (>8.8 hours) Worst year 31% not reach 99% 81% not reach 99.9% 5-nine availability? It’s just a dream? Hours

8 Root causes (sorted by count)
SoCC '16 Oct '16 Root causes (sorted by count)

9 Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Upgrade Involves multi-layers “a code push behaved differently in widespread use than it had during testing” To understand/reproduce, need full ecosystem

10 Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Human mistakes Rare now (vs. 10 years ago) Config/Upgrade software bugs Bugs in automation process Similar issues? But root cause origins are different

11 Config vs. Upgrade Research
SoCC '16 Oct '16 Config vs. Upgrade Research Upgrade #1, need more research? Paper count in last few years  Challenges: Multi-layer Full ecosystem needed Multi-year? Reproducible bugs from industry (benchmarks)? Conference Config papers Upgrade ASPLOS 1 ATC 6 2 DSN 8 EuroSys 3 NSDI OSDI 4 SOSP Total 27

12 Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Bugs What types of bugs lead to outages? Why are not masked? (pls. see paper) “Cascading” bugs

13 SoCC '16 Oct '16 “DynamoDB Storage servers query the metadata service for their membership” “But, on Sunday morning, the metadata service responses exceeded the retrieval time allowed by storage servers [busy timeout]” “As a result, the storage servers were unable to obtain their membership data, and removed themselves from taking requests” Storage servers Metadata service Remove self Timeout Busy

14 Data collection servers Memory leak
SoCC '16 Oct '16 “Each EBS storage server contacts data collection servers and reports information that is used for fleet maintenance” “data collection servers … had a failure” “this inability to contact a data collection server triggered a latent memory leak bug in the storage servers … “EBS servers continued trying in a way that slowly consumed system memory” EBS storage servers Data collection servers Memory leak Failure

15 SoCC '16 Oct '16 (more in the paper)

16 Where is the SPOF? Redundancies, redundancies, redundancies!
SoCC '16 Oct '16 Where is the SPOF? Redundancies, redundancies, redundancies! Yes, we did that So, why do outages still happen?

17 Failure recovery chain
SoCC '16 Oct '16 Failure recovery chain Failure Detection Failover Backups

18 Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail

19 Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Incomplete error/failure detection Undetected (specific type of) memory leaks Load spikes of authentication requests “an unexpected hardware behavior”

20 Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Failover/recovery that fails Bad PLC fails to activate backup power generators Failed network switch failover DC failover fails due to cold cache problems Recovery/re-mirroring storm

21 Imperfect failure recovery chain
SoCC '16 Oct '16 Imperfect failure recovery chain Incomplete Failure Detection Failover that Fails Backups that also Fail Multiple failures! Double failures of power, network, storage or server components Diverse failures: network+server; storage+fibre cut Cascading bugs … … that caused many/all redundancies to fail

22 COS Database: ? Email us / Check our website
SoCC '16 Oct '16 COS Database: ? us / Check our website More correlations between … Root cause & downtime Service maturity & downtime Root cause & impacts Root cause & fixes Etc.

23 Conclusion Features and failures are racing with each other
SoCC '16 Oct '16 Conclusion Features and failures are racing with each other “Biggest/worst cloud outages of 20YY” – a new year’s tradition Hope COS tells the cause Many more examples/details in the papers

24 Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu
SoCC '16 Oct '16 Thank you! Questions? ucare.cs.uchicago.edu ceres.cs.uchicago.edu

25 EXTRA

26 Manually extract outage “metadata” Classifications:
SoCC '16 Oct '16 Manually extract outage “metadata” Classifications:

27 SoCC '16 Oct '16 A service outage implies an unplanned unavailability of partial or full features of the service that affects all or a significant number of users, in such a way that the outage is reported publicly. Data loss, staleness, and late deliveries that lead to loss of productivity are also considered an outage.

28 #Outages/year On average Worst Year
SoCC '16 Oct '16 #Outages/year On average 1/3 of the services, at least 3 unplanned outages per year Worst Year (between ’09-’14) ½ of the services, at least 4 unplanned outages per year

29 Downtime by root cause (sorted by median downtime) COS @ SoCC '16
Oct '16 Downtime by root cause (sorted by median downtime)

30 Maturity helps? Does service maturity help? Based on outage count:
SoCC '16 Oct '16 Maturity helps? Does service maturity help? Based on outage count: In 2014, 24 outages occurred from 9-yr old services

31 Maturity helps? Based on downtime:
SoCC '16 Oct '16 Maturity helps? Based on downtime: In 2014, 267 hours of downtime from 17-yr old services More mature  more popular  more users  more complex

32 Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Load Spikes of non-monitored requests User requests (monitored) Database index accesses Authentication requests (cryptographic consumption) Misconfiguration Ex: traffic redirection Take-away: be careful with traffic-related code/configs Recovery feedback loop

33 Interesting Root Causes
SoCC '16 Oct '16 Interesting Root Causes Cross (dependencies) Amazon Web Services Airbnb, Bitbucket, Dropbox, Foursquare, Github, Heroku, Instagram, Minecraft, Netflix, Pinterest, Quora, Reddit, Vine Azure Xbox Live and “52 other services” Google DC (co-location) Google Gmail, Search, Drive, Youtube (40% drop of internet traffic for 5 mins)

34 Studies of failures, enough?
SoCC '16 Oct '16

35 Studies of failures, enough?
SoCC '16 Oct '16 Not all report “d”owntimes Most study only a few services (data behind company walls)


Download ppt "Why does the Cloud stop computing?"

Similar presentations


Ads by Google