CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman
OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion
INTRODUCTION Background : “Hardware component failure is the norm rather than exception” Presence of survivable networks is insufficient ; What if the source and destination computing resources fail ?? Abstract : Datacenters (DC) host hundreds and thousands of servers networked via hundreds of switches/routers that communicate with each other to coordinate tasks in order to deliver the cloud computing services
The servers, further consist of multiple hard disks, memory modules, network cards, processors, etc. each of which are capable of failing. The paper’s focus is on detailed analysis of component failures; and ties together component failure patterns to arrive at server failure rates for the DCs. Paper Objectives: Explore the relationship between the failures and a large no. of factors, for instance, age of the machine Quantify the relationship between successive failures on the same machine Perform predictive exploration in a DC to mine for factors that explain the reason behind failures.
show empirically that the reliability of machines that have already seen a hardware failure in the past is completely different than those of servers that have not seen any such event.
OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion
Data Sources used in the study 1. Inventory of machines: variety of information regarding the servers, for instance, unique serial no. to identify the server, location of datacenter, role of the machine 2. Hardware Replacements: This is part of the trouble tickets that are filed for hardware incidents. It includes the information like: when the ticket was filed, how the fault was fixed etc. 3. Configuration of machines: to track the failure rate of individual components, for instance, no. of hard disks, memory modules, their serial IDs, associated server ID
Server Inventory (nature and configuration of machines used in the dataset) 1.Subset of machines: details on part replacement for over 100,000 servers. 2.Age profile of machines: Age of the machine when a fault/repair happened. It was observed that 90% of the machines in the study were less than 4 years old. But there were also instances of the machines that were around 9 years old. 3.Machine Configuration: On an average there were 4 disks and 5 memory modules per server. 60% of the servers have only 1 disk ; but 20% of the servers have more than 4 disks.
OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion
Some Statistics.. All numbers reported henceforth, are normalized to 100 servers The authors observed a total of 20 replacements in a period of 14 months, contained in around 9 machines. This is an Annual Failure Rate (AFR) of 8% The average no. of repairs seen by a ‘repaired’ machine is 2 The cost of per server repair (which includes downtime; IT ticketing system to send a technician; hardware repairs is $300. This amounts close to 2.5 million dollars for 100,000 servers.
Classifying Failures for Server Hard disks are the not only the most replaced component, they are also the most dominant reason behind server failure!!
Failure Rate for Components
Component Failure Rate estimation Look at the total no. of components of each type; and determine the total no. of failure of the corresponding type The numbers are approximation as they do not provide certain information like which one of the many hard disks failed in the RAID array. The percentage is obtained by dividing the total no. of replacements with the total no. of components. This can result in double counting the disks in a RAID array, thus the values reported are an upper bound on individual component failure rate.
Age distribution of hard disk failures
Number of repairs against age in weeks In initial stage of growth it is approximately exponential; and then, as saturation begins, the growth slows, eventually remaining constant. That is, with age, failures grow almost exponentially and then after a certain saturation phase grow at a constant rate, eventually tapering off
Classifying Failures - Second Technique Classification Trees: Goal: To see if failures could be predicted using metrics collected from the environment, operation and design of the servers in the DC. Metrics used: datacenter name; location; manufacturer; design (no. of disks, memory capacity)
Important observations from Classification Trees 1. The age of the server, the configuration of the server, the location of the server within a rack, workload run on the machine, none of these were found to be a significant indicator of failures. 2. The actual DATACENTER in which the failure is located could have an important role to playing the reliability of the system. 3. The MANUFACTURER is also an interesting result as different hardware vendors have different inherent reliability values associated with them.
OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion
Examine a number of different predictors for failures Metric used: Repairs Per Machine (RPM): obtain by dividing the total no. of repairs by the total no. of machines. Process to plot the graph: 1. group machines based on no. of hard disks they contain 2. look for strong indicators of failure rate in the number of server, the average age as well as no. of hard disks 3. plot the RPM as a function of the no. of hard disks in a server.
Repairs per machine as a function of number of disks. This includes all machines, not just those that were repaired.
Repairs per machine as a function of number of disks. This is only for machines that saw at least 1 repair event.
Understanding Failure Patterns To Summarize: » There is some structure present in the failure characteristics of servers that have already seen some failure event in the past. » There is no such obvious pattern in the aggregate set of machines » The number of repairs on a machine shows a very strong correlation to the number of disks the machine has
Further understanding Successive Failures Observation: 20% of all repeat failures happen within a day of the first failure; 50% of all repeat failures happen within 2 weeks of the first failure. Distribution of Days between successive failures fits the inverse curve very well Successive Failures: » The general form of the inverse equation is represented by D = C1+ C2 / N where D is the days between successive failures, C1 and C2 are constants, and N is the number times of second repair
OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion
» Jefferey Dean presented numbers and experiences from running the Google infrastructure. He observed that disk AFR is in the range 1-5% and server crash is in the range 2 to 4%. » Google - They classified all faults and found that software related errors are around 35% followed by configuration faults around 30%. Human and networking related errors are 11% each and hardware errors are less than 10%. » Pinheiro et. al [15]. - They find that disk reliability ranges from 1.7% to 8.6%. They find that temperature and utilization have low correlation to failures. » Weihand et. Al - Their conclusion is that disk failure rate is not indicative of storage subsystem failure rate.
OUTLINE » 1.Introduction » 2. Datacenter Characterization » 3. Characterizing Faults » 4. Failure Patterns » 5. Related work » 6. Conclusion
Cloud Computing infrastructure puts onus on the underlying software; which in turn runs on commodity hardware. This makes cloud computing infrastructure vulnerable to hardware failures. Hard disks are the number ONE replaced components 8% of the servers can expect to see at least ONE hardware incident in a given year. Upon seeing a failure, the chances on seeing another failure on the same server is high. The authors observe that the distribution of successive failure on a machine fits an inverse curve. It is also observed that location of the datacenter and the manufacturer are the strongest indicators of failures.
Limitations: The reports are based on a limited time period of 14 months. The results are potentially biased against the environmental conditions, technology, workload characteristics etc. prevalent during that period. The authors do not investigate the cause of the fault or even the timing. The investigation is only the repair events at a coarse scale and understanding what model it fits.