EASY workshop 20011 The Shape of Failure Taliver Heath, Richard Martin and Thu Nguyen Rutgers University Department of Computer Science EASY Workshop July.

Slides:



Advertisements
Similar presentations
Dummy Dependent variable Models
Advertisements

Group Research 1: AKHTAR, Kamran SU, Hao SUN, Qiang TANG, Yue
Improving capital measurement using micro data Abdul Azeez Erumban CBS, the Hague.
2. Computer Clusters for Scalable Parallel Computing
Availability in Globally Distributed Storage Systems
CMPT 855Module Network Traffic Self-Similarity Carey Williamson Department of Computer Science University of Saskatchewan.
Dark and Panic Lab Computer Science, Rutgers University1 Impact of Layering and Faults on Availability and its End-to-End Implications Ricardo Bianchini,
Dark and Panic Lab Computer Science, Rutgers University1 Evaluating the Impact of Communication Architecture on Performability of Cluster-Based Services.
458 Fitting models to data – II (The Basics of Maximum Likelihood Estimation) Fish 458, Lecture 9.
Distributed Object System. Project Goals Develop a distributed system for performing time-consuming calculations. Load Balancing support. Fault Tolerance.
Before start… Earlier work single-path routing in sensor networks
An Empirical Examination of Current High-Availability Clustering Solutions’ Performance Jeffrey Absher DePaul University Research Symposium Presentation.
Performance Evaluation
1 Fundamentals of Reliability Engineering and Applications Dr. E. A. Elsayed Department of Industrial and Systems Engineering Rutgers University
2001 ©R.P.Martin Using Distributed Data Structures for Constructing Cluster-Based Servers Richard Martin, Kiran Nagaraja and Thu Nguyen Rutgers University.
Michael Over.  Which devices/links are most unreliable?  What causes failures?  How do failures impact network traffic?  How effective is network.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Transforming the data Modified from: Gotelli and Allison Chapter 8; Sokal and Rohlf 2000 Chapter 13.
1 Chapters 9 Self-SimilarTraffic. Chapter 9 – Self-Similar Traffic 2 Introduction- Motivation Validity of the queuing models we have studied depends on.
Information Networks Power Laws and Network Models Lecture 3.
1 Reading Report 9 Yin Chen 29 Mar 2004 Reference: Multivariate Resource Performance Forecasting in the Network Weather Service, Martin Swany and Rich.
Ketan Patel, Igor Markov, John Hayes {knpatel, imarkov, University of Michigan Abstract Circuit reliability is an increasingly important.
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
Chapter 4 – Modeling Basic Operations and Inputs  Structural modeling: what we’ve done so far ◦ Logical aspects – entities, resources, paths, etc. 
Traffic Modeling.
R. Kass/W03P416/Lecture 7 1 Lecture 7 Some Advanced Topics using Propagation of Errors and Least Squares Fitting Error on the mean (review from Lecture.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
THE MANAGEMENT AND CONTROL OF QUALITY, 5e, © 2002 South-Western/Thomson Learning TM 1 Chapter 9 Statistical Thinking and Applications.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Chapter 7 Point Estimation
Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.
An Energy-Efficient Voting-Based Clustering Algorithm for Sensor Networks Min Qin and Roger Zimmermann Computer Science Department, Integrated Media Systems.
Fault-Tolerant Computing Systems #4 Reliability and Availability
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
Basic Hydrology & Hydraulics: DES 601
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Robust Regression. Regression Methods  We are going to look at three approaches to robust regression:  Regression with robust standard errors  Regression.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Regression Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Assumptions of Multiple Regression 1. Form of Relationship: –linear vs nonlinear –Main effects vs interaction effects 2. All relevant variables present.
CS203 – Advanced Computer Architecture Dependability & Reliability.
R. Kass/Sp07P416/Lecture 71 More on Least Squares Fit (LSQF) In Lec 5, we discussed how we can fit our data points to a linear function (straight line)
Privacy Vulnerability of Published Anonymous Mobility Traces Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip (Purdue University) Nageswara S. V. Rao (Oak.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Chapter 3 INTERVAL ESTIMATES
Advanced Quantitative Techniques
BIOST 513 Discussion Section - Week 10
Chapter 4 Basic Estimation Techniques
Chapter 4 Continuous Random Variables and Probability Distributions
The Exponential and Gamma Distributions
Chapter 3 INTERVAL ESTIMATES
Basic Hydrology: Flood Frequency
Basic Estimation Techniques
Pick up the Pieces Average White Band.
The general linear model and Statistical Parametric Mapping
Chapter 4: Seasonal Series: Forecasting and Decomposition
Maximum Likelihood Estimation
Chapter 30: Standard Data
EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.
Basic Estimation Techniques
Lecture 15 Sections 7.3 – 7.5 Objectives:
Systems Issues for Scalable, Fault Tolerant Internet Services
Mathematical Foundations of BME Reza Shadmehr
10701 / Machine Learning Today: - Cross validation,
I. Statistical Tests: Why do we use them? What do they involve?
Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II
Estimating the number of components with defects post-release that showed no defects in testing C. Stringfellow A. Andrews C. Wohlin H. Peterson Jeremy.
EVENT PROJECTION Minzhao Liu, 2018
Copyright © 2015 Elsevier Inc. All rights reserved.
Presentation transcript:

EASY workshop The Shape of Failure Taliver Heath, Richard Martin and Thu Nguyen Rutgers University Department of Computer Science EASY Workshop July 2001

EASY workshop Goal and Motivation Characterize workstation availability Scalable Internet Services –built from clusters for scalability and fault isolation –but components not designed for availability Current Availability methods ad-hoc –over-engineer and hope for the best –restless sleep next to pagers

EASY workshop Design Approach Decompose system into components Characterize fault behavior of each component in isolation Design system so desired overall failure rate tolerates failure rates of components This work: whole workstation is a component

EASY workshop Approximating the TTF Ideal: distribution of Time to Failure (TTF) of workstation Approximate “failure” with reboot TTF  TTB

EASY workshop Methodolgy Collect system last logs Observe reboot times Collect length of time between boots (TTB) Fit observed data to multiple distributions to see which is most representative

EASY workshop Observed Systems Undergrad cluster –20 Ultra1’s open to juniors+seniors, 1 admin Machine room cluster –17 Ultra1’s,2 sparc20’s, operator access only, 3 admins Industrial cluster –8 Netra’s, 9 e450’s, 21 Ultra’s 1’s, operator access only, 1 admin

EASY workshop Matching to a distribution Maximum Likelihood Estimates to approximate the distribution Least squares fit to a quantile-quantile plot of data points to the distributions: –Exponential, Weibull, Pareto, Rayleigh Best match is a Weibull distribution

EASY workshop Measured vs. Modeled: ugrad

EASY workshop Measured vs. Molded: machine room

EASY workshop Measured vs. Modeled: Industrial

EASY workshop Side by side comparison

EASY workshop Results Workstations that have been up longer are more likely to stay up than those recently rebooted Weibull shape <1 mean systems not memoryless Similar results across all 3 clusters –timescales different, but shape of curves the same

EASY workshop Implications OS rejuvenation? –is effect large enough to observe? Useful lifetime < bathtub model? –Is a 3 year useful life < decay area? –All systems stay in the “flat-region”? Load balancing? Not clean when restarted? Upgrades

EASY workshop Limitations TTB only approximates TTF –e.g. a disk error may be a “failure” not captured –downtime not measured Many factors aggregated –difficult to determine problematic sub-component Independence assumption –model assumes independent experiments

EASY workshop Future Work Independence assumption –Conditional probability I.e. if A reboots, is B more likely to reboot soon? Event loggers (measurability) –Are reboots correlated with load? –What are the first-order factors? More/longer industrial data Diversification and comparison of systems –Same models apply to windows, linux?

EASY workshop More Info