Download presentation
Presentation is loading. Please wait.
Published byGabriel Bryant Modified over 9 years ago
1
What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake Thanh Do
2
First, let’s ask Google 2
3
Cloud era No Deep Root Causes… 3
4
What reliability research community do? Bug study 1.A Study of Linux File System Evolution. In FAST ’13. 2.A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS ’08. 3.Precomputing Possible Configuration Error Diagnoses. In ASE ’11. … 4
5
Open sourced cloud software Publicly accessible bug repositories 5
6
Study to solve… What bugs “live” in the cloud? Are there new classes of bugs unique to cloud systems? How should cloud dependability tools evolve in near future? Many others questions… 6
7
Cloud Bug Study (CBS) 6 systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume 11 people, 1 year study Issues in a 3-year window: Jan 2011 to Jan 2014 ~21000 issues reviewed ~3600 “vital” issues in-depth study Cloud Bug Study (CBS) database 7
8
Classifications Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS Hardware failures - types of hardware and types of hardware failures Software bug types – Logic, error handling, optimization, config, race, hang, space, load Implications – Failed operation, performance, component down- time, data loss, data staleness, data corruption ~25000 annotations in total, about 7 annotations per issue 8
9
Cloud Bug Study (CBS) database Open to public 9
10
Outline Introduction Methodology Overview of results Other CBS database use cases Conclusion 10
11
Methodology 6 systems, 3-year span, 2011 to 2014 20~30 bugs a day! Protein yeah! 17% “vital” issues affecting real deployments 3655 vital issues 11
12
Example issue Title Type & Priority Description Time to resolve Discussion 12
13
Outline Introduction Methodology Overview of results Other CBS database use cases Conclusion 13
14
Classifications for each vital issue Aspects Hardware types and failure modes Software bug types Implications Bug scopes 14
15
Overview of result Aspects Hardware faults vs. Software faults Implications 15
16
Aspects CS = Cassandra FL = flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper 16
17
Aspects: Reliability Reliability (45%) – Operation & job failures/errors, data loss/corruption/stalenes s 17
18
Aspects: Performance Reliability Performance (22%) 18
19
Aspects: Availability Reliability Performance Availability (16%) – Node and cluster downtime 19
20
Aspects: Security Reliability Performance Availability Security (6%) 20
21
Overview of result Aspects (classical) Aspects – Data consistency, scalability, topology, QoS Hardware faults vs. Software faults Implications 21
22
Aspects: Data consistency Data consistency (5%) – Permanent inconsistent replicas – Various root causes: Buggy operational protocol Concurrency bugs and node failures 22
23
Cassandra cross-DC synchronization 23 A’ B’ B B C’ A A C C Background operational protocols often buggy! A’ B’ C’ Permanent inconsistency
24
Aspects: Scalability Data consistency Scalability (2%) – Small number does not mean not important! – Only found at scale Large cluster size Large data Large load Large failures 24
25
Large cluster In Cassandra O(n 3 ) calculation 25 Ring position changed. 100x CPU explosion
26
Large data In HBase Tens of minutes 26 R1 R2 R3 R… R100K Insufficient lookup operation
27
Large load In HDFS 27 1000x small files in parallel … … … Not expecting small files!
28
Large failure Time cost: 7+ hours 28 AM managing 16,000 tasks fails 1 2 3 … 1K 2K 3K 4K 5K … 16K Un-optimized connection
29
From above examples… Protocol algorithms must anticipate – Large cluster sizes – Large data – Large request load of various kinds – Large scale failures The need for scalability bug detection tools 29
30
Aspects: Topology Data consistency Scalability Topology (1%) – Systems have problem when deployed on some network topology Cross DC Different racks New layering architecture – Typically unseen in pre- deployment 30
31
Aspects: QoS Data consistency Scalability Topology QoS (1%) – Fundamental for multi- tenant systems – Two main points Horizontal/intra-system QoS Vertical/cross-system QoS 31
32
Overview of result Aspects (classical) Aspects (unique) – Data consistency, scalability, topology, QoS Hardware faults vs. Software faults Implications 32
33
HW faults vs. SW faults “Hardware can fail, and reliability should come from software.” 33
34
HW faults and modes 299 improper handling of node fail-stop failure A 25% normal speed memory card causes problems in HBase deployment. 34
35
Hardware faults vs. Software faults Hardware failures, components and modes Software bug types 35
36
Software bug types: Logic Logic (29%) – Many domain-specific issues 36
37
Software bug types: Error handling Logic Error handling (18%) – Aspirator, Yuan et al, [OSDI’ 14] 37
38
Software bug types: Optimization Logic Error handling Optimization (15%) 38
39
Software bug types: Configuration Logic Error handling Optimization Configuration (14%) – Automating Configuration Troubleshooting. [OSDI ’10] – Precomputing Possible Configuration Error Diagnoses. [ASE ’11] – Do Not Blame Users for Misconfigurations. [SOSP ’13] 39
40
Software bug types: Race Race (12%) – < 50% local concurrency bugs Buggy thread interleaving Tons of work – > 50% distributed concurrency bugs Reordering of messages, crashes, timeouts More work is needed – SAMC [OSDI ’14] 40
41
Software bug types: Hang Hang (4%) – Classical deadlock – Un-served jobs, stalled operations, … Root causes? How to detect them? 41
42
Software bug types: Space Space (4%) – Big data + leak = Big leak – Clean-up operations must be flawless. 42
43
Software bug types: Load Load (4%) – Happen when systems face high request load – Relates to QoS and admission control 43
44
Overview of result Aspects (classical) Aspects (unique) – Data consistency, scalability, topology, QoS Hardware faults vs. Software faults Implications 44
45
Implications Failed operation (42%) Performance (23%) Downtimes (18%) Data loss (7%) Data corruption (5%) Data staleness (5%) 45
46
Root causes Every implication can be caused by all kinds of hardware and software faults! 46
47
“Killer” bugs Bugs that simultaneously affect multiple nodes or even the entire cluster Single Point of Failure still exists in many forms – Positive feedback loop – Buggy failover – Repeated bugs after failover –…–… 47
48
Outline Introduction Methodology Overview of results Other CBS database use cases Conclusion 48
49
CBS database 50+ per system and aggregate graphs from mining CBS database in the last one year Still more waiting to be studied… 49
50
Components with most issues How should we enhance reliability for multiple cloud system interaction? 50 Cross-system issues are prevalent!
51
Most challenging types of issues 51
52
Top k% of most complicated issue 52
53
System evolution Hadoop 2.0 53
54
Conclude One of the largest bug studies for cloud systems Many interesting findings, but more questions can be raised from our analysis – What types of performance issues exist? – Root causes for hang issues? – … Cloud Bug Study(CBS) database. 54
55
Thank you! http://ucare.cs.uchicago.edu/ 55
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.