Pick up the Pieces Average White Band.

Slides:



Advertisements
Similar presentations
Hadi Goudarzi and Massoud Pedram
Advertisements

11 Simulation. 22 Overview of Simulation – When do we prefer to develop simulation model over an analytic model? When not all the underlying assumptions.
Hypothesis testing and confidence intervals by resampling by J. Kárász.
Outline input analysis input analyzer of ARENA parameter estimation
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Model Fitting Jean-Yves Le Boudec 0. Contents 1 Virus Infection Data We would like to capture the growth of infected hosts (explanatory model) An exponential.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Performance Evaluation
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
EASY workshop The Shape of Failure Taliver Heath, Richard Martin and Thu Nguyen Rutgers University Department of Computer Science EASY Workshop July.
SIMULATION MODELING AND ANALYSIS WITH ARENA
STRATEGIES INVOLVED IN REMOTE COMPUTATION
SIMULATION MODELING AND ANALYSIS WITH ARENA
Verification & Validation
Software Reliability SEG3202 N. El Kadri.
Traffic Modeling.
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Chapter 7 Point Estimation
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University.
Bust a Move Young MC. Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi University of.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
Learning Simio Chapter 10 Analyzing Input Data
1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
GridShell/Condor: A virtual login Shell for the NSF TeraGrid (How do you run a million jobs on the NSF TeraGrid?) The University of Texas at Austin.
1
Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.
Emulating Volunteer Computing Scheduling Policies Dr. David P. Anderson University of California, Berkeley May 20, 2011.
Modeling and Simulation CS 313
OPERATING SYSTEMS CS 3502 Fall 2017
Statistical Estimation
Adam Backman Chief Cat Wrangler – White Star Software
CPU SCHEDULING.
Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.
Probability Theory and Parameter Estimation I
Performance and Fault Tolerance
GWE Core Grid Wizard Enterprise (
Chapter 5a: CPU Scheduling
Modeling and Simulation CS 313
Resource Characterization
Condor – A Hunter of Idle Workstation
Abstract Machine Layer Research in VGrADS
Predicting Queue Waiting Time For Individual User Jobs
Chapter 7: Sampling Distributions
Chapter 6: CPU Scheduling
Statistical Methods For Engineers
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
Discrete Event Simulation - 4
Stochastic Hydrology Hydrological Frequency Analysis (I) Fundamentals of HFA Prof. Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
Lecture 2 Part 3 CPU Scheduling
Validating a Random Number Generator
Schedulability Conditions for Scheduling Algorithms
Chapter 6: CPU Scheduling
Hypothesis Testing S.M.JOSHI COLLEGE ,HADAPSAR
Chapter-5 Traffic Engineering.
Module 5: CPU Scheduling
Condor-G Making Condor Grid Enabled
Comparison of data distributions: the power of Goodness-of-Fit Tests
Chapter 6: CPU Scheduling
Where real stuff starts
Upgrading Distributed Systems is not rsync
Module 5: CPU Scheduling
Presentation transcript:

Pick up the Pieces Average White Band

University of California, Santa Barbara Modeling Resource Availability in Federated, Globally Distributed Computing Environments Rich Wolski Dan Nurmi University of California, Santa Barbara John Brevik Wheaton College

Virtualization Characterize resource performance in terms of predicted Performance level (CPU fraction, BW, latency, available memory) Availability duration Classify resources in terms of Equivalence Statistical independence From these, we can build “virtual machines” with provable performance and availability characteristics Compute machines Storage machines

Sample Based Techniques Each measurement is modeled as a “sample” from a random variable Time invariant IID (independent, identically distributed) Stationary (IID forever) Well studied in the literature Exponential distributions Compose well Memoryless Popular in database and fault-tolerance communities Pareto distributions Potentially related to self-similarity “heavy-tailed” implying non-predictability Popular in networking, Internet, and Dist. System communities

Why not Weibull? Proposed originally by Waloddi Weibull in 1939 PDF: f(x) = (a/b) * ( ((x - c)/b)^(a-1) ) * e^-(((x-c)/b)^a) a is scale parameter > 0 b is shape parameter > 0 c is location parameter, (-inf,inf) Used extensively in reliability engineering Modeling lifetime distributions Modeling extreme values in bounded cases Not memoryless F(x)x+k | k <> F(x) Maximum Likelihood Estimation (MLE) of parameters is “hard” Requires solution to non-linear system of equations or optimization problem Sensitive to numerical stability of numerical algorithms

Our Initial Investigation Measure availability as “lifetime” in a variety of settings Student lab at UCSB, Condor pool New NWS availability sensors Data used in fault-tolerance community for checkpointing research Predicting optimal checkpoint Develop robust software for MLE parameter estimation Automatically Fit Exponential, Pareto, and Weibull distributions Compare the fits Visually Goodness of fit tests Goal is to provide an automated mechanism for the NWS Let the best distribution win

UCSB Student Computing Labs Approximately 85 machines running Red Hat Linux located in three separate buildings Open to all Computer Science graduate and undergraduates Only graduates have building keys Power-switch is not protected Anyone with physical access to the machine can reboot it by power cycling it Students routinely “clean off” competing users or intrusive processes to gain better performance response NWS deployed and monitoring duration between restarts Can we model the time-to-reboot?

UCSB Empirical CDF

MLE Weibull Fit to UCSB Data

Comparing Fits at UCSB

Goodness of Fit Kolmogorov-Smirnov (K-S) Goodness-of-Fit Test P-values averaged over 1000 subsamples, each size 100 Weibull: 0.36 Exponential: 2 x 10^-5 Pareto: 5 x 10^-4 Anderson-Darling (A-D) Goodness-of-Fit Test Weibull: 0.07 Exponential: 0 Pareto: 0 At .95 significance level, reject null hypothesis for both Exponential and Pareto.

Can do Better with a few Statistical Tricks

Condor Cycle harvesting system (M. Livny, U. Wisconsin) Workstations in a “pool” run the (trusted) Condor daemons Each machine agrees to contribute a machine by installing and running Condor Condor users submit job-control scripts to a batch queue When a machine becomes “idle,” Condor schedules a waiting job Machine owners specify what “idle” and “busy” mean When a machine running a Condor job becomes “busy” Job is checkpointed and requeued (standard universe) Job is terminated (vanilla universe) NWS sensor uses vanilla universe and records process lifetime Unknown and constantly changing number of workstations in UWisc Condor Pool (> 1500) 210 machines used by Condor for NWS sensor

Condor Weibull Fit

Comparing Condor Fits

Long, Muir, Golding Internet Survey (1995) 1170 Hosts “across” the Internet in 1995 Use response to rpc.statd (NFS daemon) as heartbeat Long, Muir, Golding (UCSC, HP-labs) investigated exponentials as models for Availability time Downtime Plank and Elwasif (UTK,1998) and Plank and Thomason (UTK, 2000) use data and exponentials as basis for checkpoint interval determination All researchers conclude that data is not-well modeled by exponentials No plausible distribution determined

Weibull Again

If the Weibull Fits, Wear It Three different availability surveys under three different sets of circumstances UCSB Student Labs Adversarial chaos U. Wisc Condor Pool Background cycle harvesting Internet host survey Convolution of host and network availability circa 1995 In all three cases an MLE-fit Weibull is, by far, the best model Visual and GOF evidence Uncharacteristically, the assumptions for the model seem to hold Stationarity and Independence

What Does This Mean for VGrADS? If a continuous, closed form distribution is needed to model machine availability in federated distributed systems, a Weibull is probably the best choice Empirical evidence from different scenarios makes bias unlikely Weibulls were invented to model lifetimes Why Should we Care? Grid simulators Probably useful to uGrid Optimal Checkpoint scheduling Paper in progress Replication systems Independence allows us to set the joint failure probability It does not mean, that Weibulls are best for predicting availability We can beat the distributional approach using a non-parametric method

Optimal Checkpoint Interval Goal: minimize the expected execution time given checkpoint overhead cost C for each checkpoint Old formula (Vaidya’s approximation) eL(T + C) (1 - LT) L is failure rate (exponential) and T is optimal checkpoint interval Our new formula based on Weibulls (b + C + (b + C/b)a * a * b) / ((b + C/b)a * a Two parameter Weibull with shape a and scale b Conservative value Optimal unconditional value Conditional value may be possible Requires application to recalculate interval at each checkpoint Pie in the sky for now

Where we are and What’s Next We have automatic fitting software prototyped for availability Uses mathematica and/or matlab for solver quality New NWS sensors going up on VGrADS testbed We have non--parametric failure prediction software prototyped for individual machines We need to Integrate with NWS infrastructure Develop VGrADS presentation layer Develop classification software (independence and equivalence) Translate results to time-series realm Study time-to-availability problem Develop optimal checkpoint interval determination service Dan Nurmi, John Brevik

Thanks Miron Livny and the Condor group at the University of Wisconsin Darrell Long (UCSC) and James Plank (UTK) UCSB Facilities Staff NSF and DOE nurmi@cs.ucsb.edu, jbrevik@wheatonma.edu, rich@cs.ucsb.edu