CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.

Slides:



Advertisements
Similar presentations
Key Metrics for Effective Storage Performance and Capacity Reporting.
Advertisements

Group Research 1: AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue
By Venkata Sai Pulluri ( ) Narendra Muppavarapu ( )
Chapter 19: Network Management Business Data Communications, 5e.
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Availability in Globally Distributed Storage Systems
Chapter 19: Network Management Business Data Communications, 4e.
BA 555 Practical Business Analysis
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
SECTIONS 13.1 – 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin SECONDARY STORAGE MANAGEMENT.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Figure 1.1 Interaction between applications and the operating system.
Software Defect Modeling at JPL John N. Spagnuolo Jr. and John D. Powell 19th International Forum on COCOMO and Software Cost Modeling 10/27/2004.
Control Charts for Attributes
Correlation and Regression Analysis
Michael Over.  Which devices/links are most unreliable?  What causes failures?  How do failures impact network traffic?  How effective is network.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
VIRTUALIZATION AND YOUR BUSINESS November 18, 2010 | Worksighted.
1 CHAPTER M4 Cost Behavior © 2007 Pearson Custom Publishing.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
1 Forecasting Field Defect Rates Using a Combined Time-based and Metrics-based Approach: a Case Study of OpenBSD Paul Luo Li Jim Herbsleb Mary Shaw Carnegie.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Server Hardware Chapter 22 Release 22/10/2010Jetking Infotrain Ltd.
1 Software Quality Engineering CS410 Class 5 Seven Basic Quality Tools.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
1SAS 03/ GSFC/SATC- NSWC-DD System and Software Reliability Dolores R. Wallace SRS Technologies Software Assurance Technology Center
1 An SLA-Oriented Capacity Planning Tool for Streaming Media Services Lucy Cherkasova, Wenting Tang, and Sharad Singhal HPLabs,USA.
by B. Zadrozny and C. Elkan
1 Lecture 20: WSC, Datacenters Topics: warehouse-scale computing and datacenters (Sections ) – the basics followed by a look at the future.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Software Reliability SEG3202 N. El Kadri.
Chapter 12 Examining Relationships in Quantitative Research Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Demand Planning: Forecasting and Demand Management CHAPTER TWELVE McGraw-Hill/Irwin Copyright © 2011 by the McGraw-Hill Companies, Inc. All rights reserved.
Dr. Asawer A. Alwasiti.  Chapter one: Introduction  Chapter two: Frequency Distribution  Chapter Three: Measures of Central Tendency  Chapter Four:
Example 11.2 Explaining Overhead Costs at Bendrix Scatterplots: Graphing Relationships.
Examining Relationships in Quantitative Research
Time series Decomposition Farideh Dehkordi-Vakil.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Business Data Communications, Fourth Edition Chapter 11: Network Management.
IT Professionals David Tesar | Microsoft Technical Evangelist David Aiken | Microsoft Group Technical Product Manager 07 | High Availability and Load Balancing.
VMware vSphere Configuration and Management v6
Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What.
Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
NETE4631: Network Information System Capacity Planning (2) Suronapee Phoomvuthisarn, Ph.D. /
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
BPS - 5th Ed. Chapter 231 Inference for Regression.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
MATH 598: Statistics & Modeling for Teachers June 4, 2014
Determining How Costs Behave
Green cloud computing 2 Cs 595 Lecture 15.
Large Distributed Systems
Exploring the Backblaze Hard Drive Data Big, Missing, Problematic Data
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Fault Tolerance Distributed Web-based Systems
Predict Failures with Developer Networks and Social Network Analysis
© 2017 by McGraw-Hill Education
by Mikael Bjerga & Arne Lange
Presentation transcript:

CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman

OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion

INTRODUCTION Background : “Hardware component failure is the norm rather than exception” Presence of survivable networks is insufficient ; What if the source and destination computing resources fail ?? Abstract : Datacenters (DC) host hundreds and thousands of servers networked via hundreds of switches/routers that communicate with each other to coordinate tasks in order to deliver the cloud computing services

The servers, further consist of multiple hard disks, memory modules, network cards, processors, etc. each of which are capable of failing. The paper’s focus is on detailed analysis of component failures; and ties together component failure patterns to arrive at server failure rates for the DCs. Paper Objectives: Explore the relationship between the failures and a large no. of factors, for instance, age of the machine Quantify the relationship between successive failures on the same machine Perform predictive exploration in a DC to mine for factors that explain the reason behind failures.

show empirically that the reliability of machines that have already seen a hardware failure in the past is completely different than those of servers that have not seen any such event.

OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion

Data Sources used in the study 1. Inventory of machines: variety of information regarding the servers, for instance, unique serial no. to identify the server, location of datacenter, role of the machine 2. Hardware Replacements: This is part of the trouble tickets that are filed for hardware incidents. It includes the information like: when the ticket was filed, how the fault was fixed etc. 3. Configuration of machines: to track the failure rate of individual components, for instance, no. of hard disks, memory modules, their serial IDs, associated server ID

Server Inventory (nature and configuration of machines used in the dataset) 1.Subset of machines: details on part replacement for over 100,000 servers. 2.Age profile of machines: Age of the machine when a fault/repair happened. It was observed that 90% of the machines in the study were less than 4 years old. But there were also instances of the machines that were around 9 years old. 3.Machine Configuration: On an average there were 4 disks and 5 memory modules per server. 60% of the servers have only 1 disk ; but 20% of the servers have more than 4 disks.

OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion

Some Statistics.. All numbers reported henceforth, are normalized to 100 servers The authors observed a total of 20 replacements in a period of 14 months, contained in around 9 machines. This is an Annual Failure Rate (AFR) of 8% The average no. of repairs seen by a ‘repaired’ machine is 2 The cost of per server repair (which includes downtime; IT ticketing system to send a technician; hardware repairs is $300. This amounts close to 2.5 million dollars for 100,000 servers.

Classifying Failures for Server Hard disks are the not only the most replaced component, they are also the most dominant reason behind server failure!!

Failure Rate for Components

Component Failure Rate estimation Look at the total no. of components of each type; and determine the total no. of failure of the corresponding type The numbers are approximation as they do not provide certain information like which one of the many hard disks failed in the RAID array. The percentage is obtained by dividing the total no. of replacements with the total no. of components. This can result in double counting the disks in a RAID array, thus the values reported are an upper bound on individual component failure rate.

Age distribution of hard disk failures

Number of repairs against age in weeks In initial stage of growth it is approximately exponential; and then, as saturation begins, the growth slows, eventually remaining constant. That is, with age, failures grow almost exponentially and then after a certain saturation phase grow at a constant rate, eventually tapering off

Classifying Failures - Second Technique Classification Trees: Goal: To see if failures could be predicted using metrics collected from the environment, operation and design of the servers in the DC. Metrics used: datacenter name; location; manufacturer; design (no. of disks, memory capacity)

Important observations from Classification Trees 1. The age of the server, the configuration of the server, the location of the server within a rack, workload run on the machine, none of these were found to be a significant indicator of failures. 2. The actual DATACENTER in which the failure is located could have an important role to playing the reliability of the system. 3. The MANUFACTURER is also an interesting result as different hardware vendors have different inherent reliability values associated with them.

OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion

Examine a number of different predictors for failures Metric used: Repairs Per Machine (RPM): obtain by dividing the total no. of repairs by the total no. of machines. Process to plot the graph: 1. group machines based on no. of hard disks they contain 2. look for strong indicators of failure rate in the number of server, the average age as well as no. of hard disks 3. plot the RPM as a function of the no. of hard disks in a server.

Repairs per machine as a function of number of disks. This includes all machines, not just those that were repaired.

Repairs per machine as a function of number of disks. This is only for machines that saw at least 1 repair event.

Understanding Failure Patterns To Summarize: » There is some structure present in the failure characteristics of servers that have already seen some failure event in the past. » There is no such obvious pattern in the aggregate set of machines » The number of repairs on a machine shows a very strong correlation to the number of disks the machine has

Further understanding Successive Failures Observation: 20% of all repeat failures happen within a day of the first failure; 50% of all repeat failures happen within 2 weeks of the first failure. Distribution of Days between successive failures fits the inverse curve very well Successive Failures: » The general form of the inverse equation is represented by D = C1+ C2 / N where D is the days between successive failures, C1 and C2 are constants, and N is the number times of second repair

OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion

» Jefferey Dean presented numbers and experiences from running the Google infrastructure. He observed that disk AFR is in the range 1-5% and server crash is in the range 2 to 4%. » Google - They classified all faults and found that software related errors are around 35% followed by configuration faults around 30%. Human and networking related errors are 11% each and hardware errors are less than 10%. » Pinheiro et. al [15]. - They find that disk reliability ranges from 1.7% to 8.6%. They find that temperature and utilization have low correlation to failures. » Weihand et. Al - Their conclusion is that disk failure rate is not indicative of storage subsystem failure rate.

OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion

Cloud Computing infrastructure puts onus on the underlying software; which in turn runs on commodity hardware. This makes cloud computing infrastructure vulnerable to hardware failures. Hard disks are the number ONE replaced components 8% of the servers can expect to see at least ONE hardware incident in a given year. Upon seeing a failure, the chances on seeing another failure on the same server is high. The authors observe that the distribution of successive failure on a machine fits an inverse curve. It is also observed that location of the datacenter and the manufacturer are the strongest indicators of failures.

Limitations: The reports are based on a limited time period of 14 months. The results are potentially biased against the environmental conditions, technology, workload characteristics etc. prevalent during that period. The authors do not investigate the cause of the fault or even the timing. The investigation is only the repair events at a coarse scale and understanding what model it fits.