What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake Thanh Do

First, let’s ask Google 2

Cloud era No Deep Root Causes… 3

What reliability research community do? Bug study 1.A Study of Linux File System Evolution. In FAST ’13. 2.A Comprehensive Study on Real World Concurrency Bug Characteristics. In ASPLOS ’08. 3.Precomputing Possible Configuration Error Diagnoses. In ASE ’11. … 4

Open sourced cloud software Publicly accessible bug repositories 5

Study to solve… What bugs “live” in the cloud? Are there new classes of bugs unique to cloud systems? How should cloud dependability tools evolve in near future? Many others questions… 6

Cloud Bug Study (CBS) 6 systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume 11 people, 1 year study Issues in a 3-year window: Jan 2011 to Jan 2014 ~21000 issues reviewed ~3600 “vital” issues  in-depth study Cloud Bug Study (CBS) database 7

Classifications Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS Hardware failures - types of hardware and types of hardware failures Software bug types – Logic, error handling, optimization, config, race, hang, space, load Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption ~25000 annotations in total, about 7 annotations per issue 8

Cloud Bug Study (CBS) database Open to public 9

Outline Introduction Methodology Overview of results Other CBS database use cases Conclusion 10

Methodology 6 systems, 3-year span, 2011 to 2014 20~30 bugs a day! Protein yeah! 17% “vital” issues affecting real deployments 3655 vital issues 11

Example issue Title Type & Priority Description Time to resolve Discussion 12

Classifications for each vital issue Aspects Hardware types and failure modes Software bug types Implications Bug scopes 14

Overview of result Aspects Hardware faults vs. Software faults Implications 15

Aspects CS = Cassandra FL = flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper 16

Aspects: Reliability Reliability (45%) – Operation & job failures/errors, data loss/corruption/stalenes s 17

Aspects: Performance Reliability Performance (22%) 18

Aspects: Availability Reliability Performance Availability (16%) – Node and cluster downtime 19

Aspects: Security Reliability Performance Availability Security (6%) 20

Overview of result Aspects (classical) Aspects – Data consistency, scalability, topology, QoS Hardware faults vs. Software faults Implications 21

Aspects: Data consistency Data consistency (5%) – Permanent inconsistent replicas – Various root causes: Buggy operational protocol Concurrency bugs and node failures 22

Cassandra cross-DC synchronization 23 A’ B’ B B C’ A A C C Background operational protocols often buggy! A’ B’ C’ Permanent inconsistency

Aspects: Scalability Data consistency Scalability (2%) – Small number does not mean not important! – Only found at scale Large cluster size Large data Large load Large failures 24

Large cluster In Cassandra O(n 3 ) calculation 25 Ring position changed. 100x CPU explosion

Large data In HBase Tens of minutes 26 R1 R2 R3 R… R100K Insufficient lookup operation

Large load In HDFS 27 1000x small files in parallel … … … Not expecting small files!

Large failure Time cost: 7+ hours 28 AM managing 16,000 tasks fails 1 2 3 … 1K 2K 3K 4K 5K … 16K Un-optimized connection

From above examples… Protocol algorithms must anticipate – Large cluster sizes – Large data – Large request load of various kinds – Large scale failures The need for scalability bug detection tools 29

Aspects: Topology Data consistency Scalability Topology (1%) – Systems have problem when deployed on some network topology Cross DC Different racks New layering architecture – Typically unseen in pre- deployment 30

Aspects: QoS Data consistency Scalability Topology QoS (1%) – Fundamental for multi- tenant systems – Two main points Horizontal/intra-system QoS Vertical/cross-system QoS 31

Overview of result Aspects (classical) Aspects (unique) – Data consistency, scalability, topology, QoS Hardware faults vs. Software faults Implications 32

HW faults vs. SW faults “Hardware can fail, and reliability should come from software.” 33

HW faults and modes 299 improper handling of node fail-stop failure A 25% normal speed memory card causes problems in HBase deployment. 34

Hardware faults vs. Software faults Hardware failures, components and modes Software bug types 35

Software bug types: Logic Logic (29%) – Many domain-specific issues 36

Software bug types: Error handling Logic Error handling (18%) – Aspirator, Yuan et al, [OSDI’ 14] 37

Software bug types: Optimization Logic Error handling Optimization (15%) 38

Software bug types: Configuration Logic Error handling Optimization Configuration (14%) – Automating Configuration Troubleshooting. [OSDI ’10] – Precomputing Possible Configuration Error Diagnoses. [ASE ’11] – Do Not Blame Users for Misconfigurations. [SOSP ’13] 39

Software bug types: Race Race (12%) – < 50% local concurrency bugs Buggy thread interleaving Tons of work – > 50% distributed concurrency bugs Reordering of messages, crashes, timeouts More work is needed – SAMC [OSDI ’14] 40

Software bug types: Hang Hang (4%) – Classical deadlock – Un-served jobs, stalled operations, … Root causes? How to detect them? 41

Software bug types: Space Space (4%) – Big data + leak = Big leak – Clean-up operations must be flawless. 42

Software bug types: Load Load (4%) – Happen when systems face high request load – Relates to QoS and admission control 43

Overview of result Aspects (classical) Aspects (unique) – Data consistency, scalability, topology, QoS Hardware faults vs. Software faults Implications 44

Implications Failed operation (42%) Performance (23%) Downtimes (18%) Data loss (7%) Data corruption (5%) Data staleness (5%) 45

Root causes Every implication can be caused by all kinds of hardware and software faults! 46

“Killer” bugs Bugs that simultaneously affect multiple nodes or even the entire cluster Single Point of Failure still exists in many forms – Positive feedback loop – Buggy failover – Repeated bugs after failover –…–… 47

CBS database 50+ per system and aggregate graphs from mining CBS database in the last one year Still more waiting to be studied… 49

Components with most issues How should we enhance reliability for multiple cloud system interaction? 50 Cross-system issues are prevalent!

Most challenging types of issues 51

Top k% of most complicated issue 52

System evolution Hadoop 2.0 53

Conclude One of the largest bug studies for cloud systems Many interesting findings, but more questions can be raised from our analysis – What types of performance issues exist? – Root causes for hang issues? – … Cloud Bug Study(CBS) database. 54

Thank you! http://ucare.cs.uchicago.edu/ 55

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

Similar presentations

Presentation on theme: "What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.

Similar presentations

Presentation on theme: "What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius."— Presentation transcript:

Similar presentations

About project

Feedback