Presentation is loading. Please wait.

Presentation is loading. Please wait.

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

Similar presentations


Presentation on theme: "What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,"— Presentation transcript:

1 What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Presenter: Richeng Huang 1

2 This is cloud computing era! Cloud systems are in rapid development. Complex, need to improve dependability. What Bug do we have? How to classify them? Is there cloud-unique bugs? How should dependability tools improve 2

3 Cloud Bug Study(CBS) 6 Target systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume 1 year study Issues in a 3-year window: Jan 2011 to Jan 2014 ~21000 issues reviewed ~3600(17%) “vital” issues for in-depth study vital: affect real deployed systems. 3

4 Why these 6 systems Distributed cloud computing Framework Scalable storage systems Distributed key-value stores Synchronization services Streaming systems 4

5 Methodology Issue Repositories Analysis Issue Classifications Cloud Bug Study DB (CBSDB) 5

6 Issue Reposities Luckily, Apache Software Foundation Projects each maintains a highly organized issue repository For example: Zookeeper’s Issue ReposityZookeeper’s Issue Reposity 6

7 Example 7 Title Time to resolved Description Type& Priority Discussion

8 Several Classifications Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS Hardware - processor, disk, memory, network, node. Hardware failures - Corrupt, limp, stop Software bug types – Logic, error handling, optimization, config, race, hang, space, load Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption 8

9 Aspects: Reliability Reliability (45%) - Operation & job failures/errors, data loss/corruption/staleness 9 CS = Cassandra FL = Flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper

10 Aspects: Performance Reliability (45%) Performance (22%) 10

11 Aspects: Availability Reliability (45%) Performance (22%) Availability(16%) 11

12 Aspects: Security Reliability (45%) Performance (22%) Availability(16%) Security(8%) 12

13 There’s new aspects in cloud systems Classical: - Reliability (45%) - Performance (22%) - Availability(16%) - Security(8%) New: Data consistency, scalability, topology, QoS 13

14 Aspects: Data consistency Data consistency (5%) - Permanent inconsistent replicas - Various root causes: Buggy operational protocol Concurrency bugs and node failures 14

15 Aspects Reliability (45%) Performance (22%) Availability(16%) Security(8%) Data consistency (5%) Scalability (2%) Topology(1%) QoS (1%) 15 Small numbers, but important, hard to test in small-scale

16 Aspects Reliability (45%) Performance (22%) Availability(16%) Security(8%) Data consistency (5%) Scalability (2%) Topology(1%) QoS (1%) 16 Cross DC, Different racks

17 Aspects Reliability (45%) Performance (22%) Availability(16%) Security(8%) Data consistency (5%) Scalability (2%) Topology(1%) QoS (1%) 17 Typically in vertical/cross-system QoS.

18 Killer Bugs bugs that simultaneously affect multiple nodes or even the entire cluster SPoF still exists in many forms Positive feedback loop Buggy failover Repeated bugs after failover Distributed deadlock … 18

19 Killer Bugs The figure shows heat maps of correlation between scope of killer bugs (multiple nodes or whole cluster) and hardware/software root causes. A killer bug can be caused by multiple root causes. The number in each cell represents the bug count 19

20 Positive feedback loop 20 False Failure RecoveryLoad High More False Failure More nodesGossip Traffic High More Example Case in Cassandra:

21 Repeated bugs after failover A key to no-SPoF: after a successful failover, the system should resume previously failed operation But for software bugs, a failover the system will run the same buggy logic again… In HBase, a region server dies due to a bad handling of corrupt region files, live region server that will run the same code and will also die. Eventually, all region servers go offline 21

22 HW faults vs. SW faults 22

23 HW faults and modes 299 improper handling of node fail-stop failure A 25% normal speed memory card causes problems in HBase deployment. 23

24 Software bug types Logic (29%) Error handling (18%) Optimization (15%) Configuration (14%) Data Race (12%) Hang (4%) - Deadlock Space (4%) Load (4%) 24 Logic Err-h Opt Config Race Hang Space Load

25 Implications Failed operation (42%) Performance (23%) Downtimes (18%) Data loss (7%) Data corruption (5%) Data staleness (5%) 25 Opfail Perf Down Loss Stale Corrupt

26 Software/Hardware Faults & Implications 26 Catch all faults! Long way from a highly dependable system.

27 Cloud Bug Study database (C BS DB) a total of 21,399 issues (3655 vitals) Open to public Bug evolution analysis. 27

28 System evolution 28 Hadoop 2.0

29 Conclude The largest bug studies for cloud systems to date Provide insights for a lot of intricate bugs Unique bugs in cloud systems. Killer bugs Cloud Bug Study(CBS) database. Cloud Bug Study(CBS) database 29

30 This study includes a huge amount of human effort, not efficient and maintainable. The study finds out the issues distribution, but do not have any suggestion or solution to them at all. The study analyses the issues that have all been resolved. These informations is retrievable from repositories. Experts and developers can get implication from the issue report itself. CBSDB is not active, involving large amount of maintaining time. The author did not explicitly mention how are we supposed to use this study for future development. 30 Comments

31 Thoughts and Discussion from Piazza Combine Machine learning and NLP technique for the classification and tagging task. - Hongwei Wang. They don’t provide possible solution for problem “why are cloud systems not 100% dependable?” - Eric Badger They say it is still far way 100% dependable. Need an automatic analysing tool - Sanchit Gupta 31


Download ppt "What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,"

Similar presentations


Ads by Google