Bust a Move Young MC. Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi University of.

Slides:

Advertisements

Similar presentations

Network Weather Service Sathish Vadhiyar Sources / Credits: NWS web site: NWS papers.

Advertisements

Estimation of Means and Proportions

Structural reliability analysis with probability- boxes Hao Zhang School of Civil Engineering, University of Sydney, NSW 2006, Australia Michael Beer Institute.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.

Experimental Design, Response Surface Analysis, and Optimization

Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.

1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.

The Network Weather Service A Distributed Resource Performance Forecasting Service for Metacomputing Rich Wolski, Neil T. Spring and Jim Hayes Presented.

1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.

Extremes ● An extreme value is an unusually large – or small – magnitude. ● Extreme value analysis (EVA) has as objective to quantify the stochastic behavior.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Confidence Interval Estimation in System Dynamics Models

Evaluating Hypotheses

The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing, Rich Wolski, Neil Spring, and Jim Hayes, Journal.

Parametric Inference.

Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

EASY workshop The Shape of Failure Taliver Heath, Richard Martin and Thu Nguyen Rutgers University Department of Computer Science EASY Workshop July.

BCOR 1020 Business Statistics

1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

1 Reading Report 9 Yin Chen 29 Mar 2004 Reference: Multivariate Resource Performance Forecasting in the Network Weather Service, Martin Swany and Rich.

SIMULATION MODELING AND ANALYSIS WITH ARENA

Verification & Validation

STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)

Traffic Modeling.

Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.

© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.

1 Statistical Distribution Fitting Dr. Jason Merrick.

Development Timelines Ken Kennedy Andrew Chien Keith Cooper Ian Foster John Mellor-Curmmey Dan Reed.

Copyright © Cengage Learning. All rights reserved. 14 Elements of Nonparametric Statistics.

Chapter 7 Point Estimation

10.1: Confidence Intervals – The Basics. Review Question!!! If the mean and the standard deviation of a continuous random variable that is normally distributed.

Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.

1 Chapter 18 Sampling Distribution Models. 2 Suppose we had a barrel of jelly beans … this barrel has 75% red jelly beans and 25% blue jelly beans.

Logistical Networking Micah Beck, Research Assoc. Professor Director, Logistical Computing & Internetworking (LoCI) Lab Computer.

Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University.

1 Lecture 16: Point Estimation Concepts and Methods Devore, Ch

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Sampling Distribution Models.

Sampling Distribution Models Chapter 18. Toss a penny 20 times and record the number of heads. Calculate the proportion of heads & mark it on the dot.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

Performance evaluation on grid Zsolt Németh MTA SZTAKI Computer and Automation Research Institute.

Limits to Statistical Theory Bootstrap analysis ESM April 2006.

6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.

Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.

The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.

The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.

Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.

Estimating standard error using bootstrap

CHAPTER 12 More About Regression

Statistical Estimation

Sampling Distribution Models

Resource Characterization

Pick up the Pieces Average White Band.

CHAPTER 12 More About Regression

Predicting Queue Waiting Time For Individual User Jobs

Bootstrap - Example Suppose we have an estimator of a parameter and we want to express its accuracy by its standard error but its sampling distribution.

CHAPTER 12 More About Regression

CHAPTER 12 More About Regression

Presentation transcript:

Bust a Move Young MC

Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi University of California, Santa Barbara

Explorations In Grid Computing Performance: How can programs extract high performance levels given that the resource pool is heterogeneous and dynamically changing? —The Network Weather Service —On-line performance monitoring and prediction Programming: What programming abstractions are needed to enable the Grid paradigm? —EveryWare —Toolkit for building global programs Analysis: How do we reason about the Grid globally? —G-Commerce —Systemwide efficiency, stability, etc.

Fortune Telling Grid resource performance varies dynamically —Machines, networks and storage systems are shared by competing applications —Federation Either the system or the application itself must “tolerate” performance variation —Dynamic scheduling Scheduling requires a prediction of future performance levels —What performance level will be deliverable?

Skepticism Is it really possible to predict future performance levels? —Self-similarity —Non-stationarity —With what accuracy? —For how long into the future? NWS On-line, semi non-parametric time series techniques —Use running tabulation of forecast error to choose between competing forecasters —Bandwidth, latency, CPU load, available memory, battery power

What About Machine Availability?

The “Normal” Approach Each measurement is modeled as a “sample” from a random variable —Time invariant —IID (independent, identically distributed) —Stationary (IID forever) Well studied in the literature —Exponential distributions  Compose well  Memoryless  Popular in database and fault-tolerance communities —Pareto distributions  Potentially related to self-similarity  “heavy-tailed” implying non-predictability  Popular in networking, Internet, and Dist. System communities

Our “Abnormal” Approach Measure availability as “lifetime” in a variety of settings —Student lab at UCSB, Condor pool  New NWS availability sensors —Data used in fault-tolerance community for checkpointing research  Predicting optimal checkpoint Develop robust software for MLE parameter estimation Fit Exponential, Pareto, and Weibull distributions Compare the fits —Visually —Goodness of fit tests Goal is to provide an automated mechanism for the NWS —Let the best distribution win

UCSB Student Computing Labs Approximately 85 machines running Red Hat Linux located in three separate buildings Open to all Computer Science graduate and undergraduates —Only graduates have building keys Power-switch is not protected —Anyone with physical access to the machine can reboot it by power cycling it Students routinely “clean off” competing users or intrusive processes to gain better performance response NWS deployed and monitoring duration between restarts Can we model the time-to-reboot?

UCSB Empirical CDF

MLE Weibull Fit to UCSB Data

Comparing Fits at UCSB

The Visual Acid Test

More Systems Condor: Cycle harvesting system (M. Livny, U. Wisconsin) —Workstations in a “pool” run the (trusted) Condor daemons —When a machine running a Condor job becomes “busy” Job is terminated (vanilla universe) —Unknown and constantly changing number of workstations in UWisc Condor Pool (~ 1000 Linux Workstations) Long, Muir, Golding Internet Survey (1995) —Pinged the rpc.statd as a heartbeat —Used extensive in fault-tolerance community to model host failure —1170 hosts covering 3 months of Spring

The Condor Picture April 2003 through Oct 2004, 600 hosts

More Condor April 2003 through July 2005, 900 hosts

Condor Clusters April 2003 through July 2005, 730 hosts

Condor Non-cluster April 2003 through July 2005, 170 hosts

Modeling Lessons Machine availability looks like it is well-modeled by a Weibull, but Condor process lifetime is trickier —Hyper-exponentials do well, but hard to fit and use —Log-normal looks better in the large (need to investigate more) —May be able to do piece-wise fit for “desktops” Who should care? —Grid simulators  Availability is critical —P2P systems  Oceanstore, CAN, TAPESTRY, etc. all assume very basic availability distributions in their proofs —Replication systems It does not mean, that model fitting works best for predicting availability => data shortage

Predicting Individual Machine Behavior Estimating Mean Time to Failure (MTTF) is relatively easy —Unless the data is Pareto, the mean is the “expected value” —Probably not what is needed to support scheduling  The cost of being below the mean is not the same as the cost of being above it “At least how much time will elapse before this machine reboots with 95% certainty?” —The answer is the 0.05 quantile (not an expectation) from the cumulative distribution function (CDF)

Certainty in an Uncertain World Predictions of the form —“For at least how long with this machine be available with X% certainty?” Requires two estimates if certainty is to be quantified —Estimate the (1-X) quantile for the distribution of availability => Q x —Estimate the lower X% confidence bound on the statistic Q x => Q (x,lb) If the estimates are unbiased, and the distribution is stationary, future availability duration will be larger than Q (x,lb) X% of the time, guaranteed

Neo-classical Methods The classical (parametric) method has some drawbacks —Which distribution? —MLE is computationally challenging or impossible for some distributions and/or data sets —Requires quite a bit of data to get a good fit —Quantiles near the tails are “squeezed” so fit error is significant —Estimating confidence bounds for high-order models is computationally (and theoretically) difficult Non-parametric techniques —Can usually only recover a statistic and not the distribution —Those that appeal to the CLT may have an asymptote problem New non-parametric invention based on Binomial assumptions

Experiments in Fortune Telling CSIL, Condor (2 years), and Long data sets Split into training and experimental periods —Use only machines with 20 training samples or more —Using synthetic data we noticed that be best method needed at least 20 samples Use methods to estimate 95% confidence on 0.05 quantile from training period Record success if 95%-100% of the remaining experimental availability durations >= estimate Report % success (want to see 95%)

Non-parametric methods seem to work Data Set MLE Weibull Method BootstrappingBinomial Method CSIL93.3%96.7%96.6% Condor99.5%97.1%95.0% Long91.7%96.0%97.3% Weibull over-estimates the tail for Condor data Bootstrapping works okay, but is very computationally expensive

On-going Work with Condor Checkpoint scheduling —Parametric method reduces network load dramatically Applications —LDPC investigation => lowest observed error rates —Ramsey search —GridSAT —UCSBGrid Automatic Program Overlay —On-demand Condor as a grid programming middleware NWS Condor Integration —Publishing NWS forecasts via Hawkeye —Incorporating machine availability predictor

Thanks and More Miron Livny and the Condor group at the University of Wisconsin Darrell Long (UCSC) and James Plank (UTK) UCSB Facilities Staff NSF SCI and DOE Middleware and Applications Yielding Heterogeneous Environments for Metacomputing at UCSB Students: Matthew Allen, Wahid Chrabakh, Ryan Garver, Andrew Mutz, Dan Nurmi, Erik Peterson, Fred Tu, Lamia Youseff, Ye Wen Research Staff: John Brevik, Graziano Obertelli