What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development.
NoSQL Databases: MongoDB vs Cassandra
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Operating Systems.
What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?
Database Design – Lecture 16
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
HDFS Hadoop Distributed File System
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Transparency in Distributed Operating Systems Vijay Akkineni.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Distributed database system
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
By Vaibhav Nachankar Arvind Dwarakanath.  HBase is an open-source, distributed, column- oriented and sorted-map data storage.  It is a Hadoop Database;
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
70-412: Configuring Advanced Windows Server 2012 services
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Riza Suminto, Agung Laksono *, Anang Satria *, Thanh Do †, Haryadi Gunawi †*
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi.
Why does the Cloud stop computing?
MapReduce Compiler RHadoop
Services DFS, DHCP, and WINS are cluster-aware.
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
An Open Source Project Commonly Used for Processing Big Data Sets
Chapter 15: Networking Services Design Optimization
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
湖南大学-信息科学与工程学院-计算机与科学系
An Introduction to Computer Networking
Ewen Cheslack-Postava
INFO 344 Web Tools And Development
Hadoop Technopoints.
Distributed computing deals with hardware
Building continuously available systems with Hyper-V
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Presentation transcript:

What Bugs Live in the Cloud? A Study of Issues in Cloud Systems Authors: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria Presenter: Richeng Huang 1

This is cloud computing era! Cloud systems are in rapid development. Complex, need to improve dependability. What Bug do we have? How to classify them? Is there cloud-unique bugs? How should dependability tools improve 2

Cloud Bug Study(CBS) 6 Target systems: Hadoop MapReduce, HDFS, HBase, Cassandra, Zookeeper, and Flume 1 year study Issues in a 3-year window: Jan 2011 to Jan 2014 ~21000 issues reviewed ~3600(17%) “vital” issues for in-depth study vital: affect real deployed systems. 3

Why these 6 systems Distributed cloud computing Framework Scalable storage systems Distributed key-value stores Synchronization services Streaming systems 4

Methodology Issue Repositories Analysis Issue Classifications Cloud Bug Study DB (CBSDB) 5

Issue Reposities Luckily, Apache Software Foundation Projects each maintains a highly organized issue repository For example: Zookeeper’s Issue ReposityZookeeper’s Issue Reposity 6

Example 7 Title Time to resolved Description Type& Priority Discussion

Several Classifications Aspects – Reliability, performance, availability, security, consistency, scalability, topology, QoS Hardware - processor, disk, memory, network, node. Hardware failures - Corrupt, limp, stop Software bug types – Logic, error handling, optimization, config, race, hang, space, load Implications – Failed operation, performance, component downtime, data loss, data staleness, data corruption 8

Aspects: Reliability Reliability (45%) - Operation & job failures/errors, data loss/corruption/staleness 9 CS = Cassandra FL = Flume HB = HBase HD = HDFS MR = MapReduce ZK = ZooKeeper

Aspects: Performance Reliability (45%) Performance (22%) 10

Aspects: Availability Reliability (45%) Performance (22%) Availability(16%) 11

Aspects: Security Reliability (45%) Performance (22%) Availability(16%) Security(8%) 12

There’s new aspects in cloud systems Classical: - Reliability (45%) - Performance (22%) - Availability(16%) - Security(8%) New: Data consistency, scalability, topology, QoS 13

Aspects: Data consistency Data consistency (5%) - Permanent inconsistent replicas - Various root causes: Buggy operational protocol Concurrency bugs and node failures 14

Aspects Reliability (45%) Performance (22%) Availability(16%) Security(8%) Data consistency (5%) Scalability (2%) Topology(1%) QoS (1%) 15 Small numbers, but important, hard to test in small-scale

Aspects Reliability (45%) Performance (22%) Availability(16%) Security(8%) Data consistency (5%) Scalability (2%) Topology(1%) QoS (1%) 16 Cross DC, Different racks

Aspects Reliability (45%) Performance (22%) Availability(16%) Security(8%) Data consistency (5%) Scalability (2%) Topology(1%) QoS (1%) 17 Typically in vertical/cross-system QoS.

Killer Bugs bugs that simultaneously affect multiple nodes or even the entire cluster SPoF still exists in many forms Positive feedback loop Buggy failover Repeated bugs after failover Distributed deadlock … 18

Killer Bugs The figure shows heat maps of correlation between scope of killer bugs (multiple nodes or whole cluster) and hardware/software root causes. A killer bug can be caused by multiple root causes. The number in each cell represents the bug count 19

Positive feedback loop 20 False Failure RecoveryLoad High More False Failure More nodesGossip Traffic High More Example Case in Cassandra:

Repeated bugs after failover A key to no-SPoF: after a successful failover, the system should resume previously failed operation But for software bugs, a failover the system will run the same buggy logic again… In HBase, a region server dies due to a bad handling of corrupt region files, live region server that will run the same code and will also die. Eventually, all region servers go offline 21

HW faults vs. SW faults 22

HW faults and modes 299 improper handling of node fail-stop failure A 25% normal speed memory card causes problems in HBase deployment. 23

Software bug types Logic (29%) Error handling (18%) Optimization (15%) Configuration (14%) Data Race (12%) Hang (4%) - Deadlock Space (4%) Load (4%) 24 Logic Err-h Opt Config Race Hang Space Load

Implications Failed operation (42%) Performance (23%) Downtimes (18%) Data loss (7%) Data corruption (5%) Data staleness (5%) 25 Opfail Perf Down Loss Stale Corrupt

Software/Hardware Faults & Implications 26 Catch all faults! Long way from a highly dependable system.

Cloud Bug Study database (C BS DB) a total of 21,399 issues (3655 vitals) Open to public Bug evolution analysis. 27

System evolution 28 Hadoop 2.0

Conclude The largest bug studies for cloud systems to date Provide insights for a lot of intricate bugs Unique bugs in cloud systems. Killer bugs Cloud Bug Study(CBS) database. Cloud Bug Study(CBS) database 29

This study includes a huge amount of human effort, not efficient and maintainable. The study finds out the issues distribution, but do not have any suggestion or solution to them at all. The study analyses the issues that have all been resolved. These informations is retrievable from repositories. Experts and developers can get implication from the issue report itself. CBSDB is not active, involving large amount of maintaining time. The author did not explicitly mention how are we supposed to use this study for future development. 30 Comments

Thoughts and Discussion from Piazza Combine Machine learning and NLP technique for the classification and tagging task. - Hongwei Wang. They don’t provide possible solution for problem “why are cloud systems not 100% dependable?” - Eric Badger They say it is still far way 100% dependable. Need an automatic analysing tool - Sanchit Gupta 31