Download presentation
Presentation is loading. Please wait.
Published byRalf Stokes Modified over 9 years ago
1
Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to you? - Bianca Schroeder and Garth A. Gibson. FAST 2007 Failure Trends in a Large Disk Drive Population - Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso, Google Inc. FAST 2007 Presented by : Yaroslav Kagansky
2
Free Powerpoint Templates Page 2 Lecture Contents Research methodology in this field. MTTF && AFR – widely used yet not so precise. Various factors that affect disc’s life time. SMART Data analysis and their ability to predict future disc failures. Conclusions and my point of view.
3
Free Powerpoint Templates Page 3 Few words about the papers Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to you? Focuses on MTTF, AFR accuracy and other common assumptions in the field of disc failures. Based on hardware replacement and warranty service logs. Examines various rotation speeds and interfaces (i.e. SATA, SCASI, FC). Data was collected from different organizations. Failure Trends in a Large Disk Drive Population Focuses on building a prediction disk failure prediction Model. Data was collected using a ‘software demon’ that was running on Google's servers. Examines cheap discs only (i.e. 5400/7200 STATA drives) Based on data from Google only.
4
Free Powerpoint Templates Page 4 Research methodology How should we define a ‘disc failure’? Both of the paper define a failure event as drive is considered to have failed if it was replaced as part of a repairs procedure. Hard drive is a very complicated system Large amounts of data are needed in order to come to quality conclusions. How was the data collected? Google’s system (next slide) Hardware replacement and warranty service logs. Ignoring bad batches
5
Free Powerpoint Templates Page 5 The complicity of a storage system
6
Free Powerpoint Templates Page 6 Google’s data collection system The demon collects various types of information form Google's servers. The data is being stored at a central repository for future analysis (GFS format). The data is analyzed with Mapreduce framework
7
Free Powerpoint Templates Page 7 Reliability metrics Annualized Failure Rate (AFR) The percentage of disk drives in a population that fail in a test scaled to a per year estimation Typically based on extrapolation from accelerated life test data of small populations or from returned unit databases – Provided by the vendors Accelerated life tests doesn’t take into account Environmental factors. Poor predictors of actual failure rates. Mean Time To Failure (MTTF) The MTTF is estimated as the number of power on hours per year divided by the AFR
8
Free Powerpoint Templates Page 8 AFR inaccuricy shows a significant discrepancy between the observed ARR and the datasheet AFR for all data sets. While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by data set and type, are by up to a factor of 15 higher than datasheet AFRs
9
Free Powerpoint Templates Page 9 Cumulative operating time Failure rates of hardware products typically follow a “bathtub curve” with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle. The Figure above shows the failure rate pattern that is expected for the life cycle of hard Drives.
10
Free Powerpoint Templates Page 10 Age-dependent replacement rates Replacement rates in all years (except the first) are larger than the data sheet. Replacement rates are rising significantly over the years
11
Free Powerpoint Templates Page 11 Age-dependent replacement rates Steadily increasing replacement rate doesn’t come along with the common assumption that after the first year the replacement rate stays steady. By observing the figure from the pervious slide we see that early onset of wear-out seems to have a much stronger impact on lifecycle replacement rates than ‘infant mortality’.
12
Free Powerpoint Templates Page 12 Utilization We define ‘utilization’ as the fraction of time the drive is active out of the time it is powered on We expect to notice very strong correlation between high utilization and higher failure rates. But the results appear to paint more complex picture that that..
13
Free Powerpoint Templates Page 13 Utilization Only very young and very old disc groups appear to show the expected behavior It’s possible that failure modes that associated with higher utilization are more prominent early in drive’s lifetime. the drives that survive the infant mortality phase are the least susceptible to that failure mode high correlation between utilization and failures has been based on extrapolations from manufacturers’ accelerated life experiments. Those experiments are likely to better model early life failure characteristics and as such they agree with the trend we observe for the young age groups
14
Free Powerpoint Templates Page 14 Temperature Temperature is often quoted as the most important environmental factor affecting disk drive reliability. Previous studies have indicated that temperature deltas as low as 15C can nearly double disk drive failure rates. But again, we get very surprising results..
15
Free Powerpoint Templates Page 15 Temperature Temperature effects only for the high end of our range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do.
16
Free Powerpoint Templates Page 16 Failure rates && Poisson proocess The Poisson assumption implies that the number of failures during a given time interval (e.g. a week or a month) is distributed according to the Poisson distribution (Poisson process) Key property of this distribution is independence of failures Time between time between failures also doesn’t fit exponential distribution. The researchers found strong correlation between failures in consecutive weeks and months. The correlation coefficient between consecutive weeks is 0.72, and the correlation coefficient between consecutive months is 0.79.
17
Free Powerpoint Templates Page 17 Correlation between failure Number of disk replacements in a week depending on the number of disk replacements in the previous week. The fact that failure rates aren’t steady over the lifetime of the system may cause the poor fit to Poisson process
18
Free Powerpoint Templates Page 18 Using SMART data to predict failures SMART Self-Monitoring Analysis and Reporting Technology The researchers tried to build a disc failure prediction model according to data the can be acquired from disc’s SMART parameters. They tried to find the SMART parameters that have the strongest correlation with future failures. Can we build a reliable failure prediction model based on SMART only?
19
Free Powerpoint Templates Page 19 Scan Errors Large scan error counts can be indicative of surface defects, and therefore are believed to be indicative of lower reliability. They found that the group of drives with scan errors are ten times more likely to fail than the group with no errors It was found that the amount of errors decreases the chance of a disc to survive.
20
Free Powerpoint Templates Page 20 Reallocation Counts When the drive’s logic believes that a sector is damaged (typically as a result of recurring soft errors or a hard error) it can remap the faulty sector number to a new physical sector drawn from a pool of spares. Reallocation counts reflect the number of times this has happened, and is seen as an indication of drive surface wear The researchers found that After their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts.
21
Free Powerpoint Templates Page 21 Probational Counts Disk drives put suspect bad sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems. Probational counts therefore, can be seen as a softer error indication Drives with non zero probational counts are 16 times more likely to fail within 60 days than drives with zero probational counts.
22
Free Powerpoint Templates Page 22 Other parameters that were studied The researchers also examined other parameters but they didn’t find strong correlation between them and disc failures Seek Errors - Seek errors occur when a disk drive fails to properly track a sector and needs to wait for another revolution to read or write from or to a sector. For some manufacturers, there is no correlation between failure rates and seek errors. Power Cycles - The power cycles indicator counts the number of times a drive is powered up and down. For 2 years old discs there is no significant correlation between failures and high power cycles count, But for drives 3 years and older, higher power cycle counts can increase the absolute failure rate by over 2%
23
Free Powerpoint Templates Page 23 Predictive Power of SMART Parameters Given how strongly correlated some SMART parameters were found to be with higher failure rates, they were hopeful that accurate predictive failure models based on SMART signals could be created. However.. Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives.
24
Free Powerpoint Templates Page 24 Conclusions It is very difficult to conduct a serious research in the field of disc failures A lot of data is needed to be collected. There isn’t much related work that was done in this field. Mostly vendor’s technical papers. AFFR, MTTR and some common assumptions about disc failures tend to be incorrect. The affect of temperature on the fail rate. Correlation between disc failures. SMART parameters can be used for building a disc failure prediction model Even the most indicative parameters that were presented here couldn't predict nearly half of the failures. It is possible, however, that models that use parameters beyond those provided by SMART could achieve significantly better accuracies.
25
Free Powerpoint Templates Page 25 My point of view Research in this field is very important A lot of resources can be save if we will be able to predict disc failure. How can we make a research in this field easier? Non of the papers present a good prediction model. Both of them only critic the current situation. A good continuation for both of the papers would be presenting a prediction model and examining it’s achievements. Not a enough details about the software aspects of the machines that were tested. (i.e. which OS and programs were those servers running) What about home users and small organizations?? Maybe the MTTF/ AFR is more accurate when it comes those users
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.