Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What.

Slides:



Advertisements
Similar presentations
S-Curves & the Zero Bug Bounce:
Advertisements

Ensieea Rizwani Disk Failures in the real world:
"Failure is not an option. It comes bundled with your system.“ (--unknown)
10. 5: Model Solution Model Interpretation 10
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Predictor of Customer Perceived Software Quality By Haroon Malik.
G. Alonso, D. Kossmann Systems Group
1 Exponential Distribution and Reliability Growth Models Kan Ch 8 Steve Chenoweth, RHIT Right: Wait – I always thought “exponential growth” was like this!
Statistical Techniques I EXST7005 Lets go Power and Types of Errors.
EC2 demystification, server power efficiency, disk drive reliability CSE 490h, Autumn 2008.
Disk Scrubbing in Large Archival Storage Systems Thomas Schwarz, S.J. 1,2 Qin Xin 1,3, Ethan Miller 1, Darrell Long 1, Andy Hospodor 1,2, Spencer Ng 3.
SMJ 4812 Project Mgmt and Maintenance Eng.
Copyright © 2015, 2011, 2008 Pearson Education, Inc. Chapter 5, Unit E, Slide 1 Statistical Reasoning 5.
Time-Dependent Failure Models
Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
Reliability A. A. Elimam. Reliability: Definition The ability of a product to perform its intended function over a period of time and under prescribed.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.2.1 FAULT TOLERANT SYSTEMS Part 2 – Canonical.
1 4. Multiple Regression I ECON 251 Research Methods.
1 Review Definition: Reliability is the probability that a component or system will perform a required function for a given period of time when used under.
Reliability Chapter 4S.
Lecture 17 Interaction Plots Simple Linear Regression (Chapter ) Homework 4 due Friday. JMP instructions for question are actually for.
The Human Population & Earth’s Carrying Capacity A Real-Life Game of Musical Chairs
Software Testing and QA Theory and Practice (Chapter 15: Software Reliability) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Relationships Among Variables
1 CHAPTER M4 Cost Behavior © 2007 Pearson Custom Publishing.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
Software Reliability Categorising and specifying the reliability of software systems.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 15 Software Reliability
INFO 637Lecture #81 Software Engineering Process II Integration and System Testing INFO 637 Glenn Booker.
Near East University Department of English Language Teaching Advanced Research Techniques Correlational Studies Abdalmonam H. Elkorbow.
Dr. Richard Young Optronic Laboratories, Inc..  Uncertainty budgets are a growing requirement of measurements.  Multiple measurements are generally.
Software Reliability SEG3202 N. El Kadri.
COMP5348 Enterprise Scale Software Architecture Semester 1, 2010 Lecture 10. Fault Tolerance Based on material by Alan Fekete.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Event Management & ITIL V3
Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
© The McGraw-Hill Companies, 2005 CAPITAL ACCUMULATION AND GROWTH: THE BASIC SOLOW MODEL Chapter 3 – second lecture Introducing Advanced Macroeconomics:
INFO 636 Software Engineering Process I Prof. Glenn Booker Week 9 – Quality Management 1INFO636 Week 9.
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 3 1 Software Size Estimation I Material adapted from: Disciplined.
J1879 Robustness Validation Hand Book A Joint SAE, ZVEI, JSAE, AEC Automotive Electronics Robustness Validation Plan The current qualification and verification.
© The McGraw-Hill Companies, 2005 TECHNOLOGICAL PROGRESS AND GROWTH: THE GENERAL SOLOW MODEL Chapter 5 – second lecture Introducing Advanced Macroeconomics:
Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.
Section 10.1 Confidence Intervals
Introduction to the Practice of Statistics Fifth Edition Chapter 6: Introduction to Inference Copyright © 2005 by W. H. Freeman and Company David S. Moore.
Smith/Davis (c) 2005 Prentice Hall Chapter Nine Probability, the Normal Curve, and Sampling PowerPoint Presentation created by Dr. Susan R. Burns Morningside.
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Research Strategies. Why is Research Important? Answer in complete sentences in your bell work spiral. Discuss the consequences of good or poor research.
Resilience at Scale: The importance of real world data Bianca Schroeder Computer Science Department University of Toronto.
Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.
Intermediate Systems Acquisition Course Produce/Deploy/Support: Production Manufacturing Page X of Y Welcome to Production Costs and Quality On average,
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Statistical Techniques
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Part.2.1 In The Name of GOD FAULT TOLERANT SYSTEMS Part 2 – Canonical Structures Chapter 2 – Hardware Fault Tolerance.
LOAD FORECASTING. - ELECTRICAL LOAD FORECASTING IS THE ESTIMATION FOR FUTURE LOAD BY AN INDUSTRY OR UTILITY COMPANY - IT HAS MANY APPLICATIONS INCLUDING.
If you have a transaction processing system, John Meisenbacher
CS203 – Advanced Computer Architecture Dependability & Reliability.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Stats Methods at IC Lecture 3: Regression.
Software Metrics and Reliability
Hardware & Software Reliability
Most people will have some concept of what reliability is from everyday life, for example, people may discuss how reliable their washing machine has been.
Unit 5: Hypothesis Testing
CHAPTER 26: Inference for Regression
CHAPTER 18: Inference in Practice
Significance Tests: The Basics
Presentation transcript:

Free Powerpoint Templates Page 1 Free Powerpoint Templates Advanced Topics in Storage Systems Disk Filures Based on: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to you? - Bianca Schroeder and Garth A. Gibson. FAST 2007 Failure Trends in a Large Disk Drive Population - Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso, Google Inc. FAST 2007 Presented by : Yaroslav Kagansky

Free Powerpoint Templates Page 2 Lecture Contents Research methodology in this field. MTTF && AFR – widely used yet not so precise. Various factors that affect disc’s life time. SMART Data analysis and their ability to predict future disc failures. Conclusions and my point of view.

Free Powerpoint Templates Page 3 Few words about the papers Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to you? Focuses on MTTF, AFR accuracy and other common assumptions in the field of disc failures. Based on hardware replacement and warranty service logs. Examines various rotation speeds and interfaces (i.e. SATA, SCASI, FC). Data was collected from different organizations. Failure Trends in a Large Disk Drive Population Focuses on building a prediction disk failure prediction Model. Data was collected using a ‘software demon’ that was running on Google's servers. Examines cheap discs only (i.e. 5400/7200 STATA drives) Based on data from Google only.

Free Powerpoint Templates Page 4 Research methodology How should we define a ‘disc failure’? Both of the paper define a failure event as drive is considered to have failed if it was replaced as part of a repairs procedure. Hard drive is a very complicated system Large amounts of data are needed in order to come to quality conclusions. How was the data collected? Google’s system (next slide) Hardware replacement and warranty service logs. Ignoring bad batches

Free Powerpoint Templates Page 5 The complicity of a storage system

Free Powerpoint Templates Page 6 Google’s data collection system The demon collects various types of information form Google's servers. The data is being stored at a central repository for future analysis (GFS format). The data is analyzed with Mapreduce framework

Free Powerpoint Templates Page 7 Reliability metrics Annualized Failure Rate (AFR) The percentage of disk drives in a population that fail in a test scaled to a per year estimation Typically based on extrapolation from accelerated life test data of small populations or from returned unit databases – Provided by the vendors Accelerated life tests doesn’t take into account Environmental factors. Poor predictors of actual failure rates. Mean Time To Failure (MTTF) The MTTF is estimated as the number of power on hours per year divided by the AFR

Free Powerpoint Templates Page 8 AFR inaccuricy shows a significant discrepancy between the observed ARR and the datasheet AFR for all data sets. While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by data set and type, are by up to a factor of 15 higher than datasheet AFRs

Free Powerpoint Templates Page 9 Cumulative operating time Failure rates of hardware products typically follow a “bathtub curve” with high failure rates at the beginning (infant mortality) and the end (wear-out) of the lifecycle. The Figure above shows the failure rate pattern that is expected for the life cycle of hard Drives.

Free Powerpoint Templates Page 10 Age-dependent replacement rates Replacement rates in all years (except the first) are larger than the data sheet. Replacement rates are rising significantly over the years

Free Powerpoint Templates Page 11 Age-dependent replacement rates Steadily increasing replacement rate doesn’t come along with the common assumption that after the first year the replacement rate stays steady. By observing the figure from the pervious slide we see that early onset of wear-out seems to have a much stronger impact on lifecycle replacement rates than ‘infant mortality’.

Free Powerpoint Templates Page 12 Utilization We define ‘utilization’ as the fraction of time the drive is active out of the time it is powered on We expect to notice very strong correlation between high utilization and higher failure rates. But the results appear to paint more complex picture that that..

Free Powerpoint Templates Page 13 Utilization Only very young and very old disc groups appear to show the expected behavior It’s possible that failure modes that associated with higher utilization are more prominent early in drive’s lifetime. the drives that survive the infant mortality phase are the least susceptible to that failure mode high correlation between utilization and failures has been based on extrapolations from manufacturers’ accelerated life experiments. Those experiments are likely to better model early life failure characteristics and as such they agree with the trend we observe for the young age groups

Free Powerpoint Templates Page 14 Temperature Temperature is often quoted as the most important environmental factor affecting disk drive reliability. Previous studies have indicated that temperature deltas as low as 15C can nearly double disk drive failure rates. But again, we get very surprising results..

Free Powerpoint Templates Page 15 Temperature Temperature effects only for the high end of our range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do.

Free Powerpoint Templates Page 16 Failure rates && Poisson proocess The Poisson assumption implies that the number of failures during a given time interval (e.g. a week or a month) is distributed according to the Poisson distribution (Poisson process) Key property of this distribution is independence of failures Time between time between failures also doesn’t fit exponential distribution. The researchers found strong correlation between failures in consecutive weeks and months. The correlation coefficient between consecutive weeks is 0.72, and the correlation coefficient between consecutive months is 0.79.

Free Powerpoint Templates Page 17 Correlation between failure Number of disk replacements in a week depending on the number of disk replacements in the previous week. The fact that failure rates aren’t steady over the lifetime of the system may cause the poor fit to Poisson process

Free Powerpoint Templates Page 18 Using SMART data to predict failures SMART Self-Monitoring Analysis and Reporting Technology The researchers tried to build a disc failure prediction model according to data the can be acquired from disc’s SMART parameters. They tried to find the SMART parameters that have the strongest correlation with future failures. Can we build a reliable failure prediction model based on SMART only?

Free Powerpoint Templates Page 19 Scan Errors Large scan error counts can be indicative of surface defects, and therefore are believed to be indicative of lower reliability. They found that the group of drives with scan errors are ten times more likely to fail than the group with no errors It was found that the amount of errors decreases the chance of a disc to survive.

Free Powerpoint Templates Page 20 Reallocation Counts When the drive’s logic believes that a sector is damaged (typically as a result of recurring soft errors or a hard error) it can remap the faulty sector number to a new physical sector drawn from a pool of spares. Reallocation counts reflect the number of times this has happened, and is seen as an indication of drive surface wear The researchers found that After their first reallocation, drives are over 14 times more likely to fail within 60 days than drives without reallocation counts.

Free Powerpoint Templates Page 21 Probational Counts Disk drives put suspect bad sectors “on probation” until they either fail permanently and are reallocated or continue to work without problems. Probational counts therefore, can be seen as a softer error indication Drives with non zero probational counts are 16 times more likely to fail within 60 days than drives with zero probational counts.

Free Powerpoint Templates Page 22 Other parameters that were studied The researchers also examined other parameters but they didn’t find strong correlation between them and disc failures Seek Errors - Seek errors occur when a disk drive fails to properly track a sector and needs to wait for another revolution to read or write from or to a sector. For some manufacturers, there is no correlation between failure rates and seek errors. Power Cycles - The power cycles indicator counts the number of times a drive is powered up and down. For 2 years old discs there is no significant correlation between failures and high power cycles count, But for drives 3 years and older, higher power cycle counts can increase the absolute failure rate by over 2%

Free Powerpoint Templates Page 23 Predictive Power of SMART Parameters Given how strongly correlated some SMART parameters were found to be with higher failure rates, they were hopeful that accurate predictive failure models based on SMART signals could be created. However.. Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives.

Free Powerpoint Templates Page 24 Conclusions It is very difficult to conduct a serious research in the field of disc failures A lot of data is needed to be collected. There isn’t much related work that was done in this field. Mostly vendor’s technical papers. AFFR, MTTR and some common assumptions about disc failures tend to be incorrect. The affect of temperature on the fail rate. Correlation between disc failures. SMART parameters can be used for building a disc failure prediction model Even the most indicative parameters that were presented here couldn't predict nearly half of the failures. It is possible, however, that models that use parameters beyond those provided by SMART could achieve significantly better accuracies.

Free Powerpoint Templates Page 25 My point of view Research in this field is very important A lot of resources can be save if we will be able to predict disc failure. How can we make a research in this field easier? Non of the papers present a good prediction model. Both of them only critic the current situation. A good continuation for both of the papers would be presenting a prediction model and examining it’s achievements. Not a enough details about the software aspects of the machines that were tested. (i.e. which OS and programs were those servers running) What about home users and small organizations?? Maybe the MTTF/ AFR is more accurate when it comes those users