Download presentation
Presentation is loading. Please wait.
Published byLoreen Briggs Modified over 8 years ago
1
Bust a Move Young MC
2
Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi University of California, Santa Barbara
3
Explorations In Grid Computing Performance: How can programs extract high performance levels given that the resource pool is heterogeneous and dynamically changing? —The Network Weather Service —On-line performance monitoring and prediction Programming: What programming abstractions are needed to enable the Grid paradigm? —EveryWare —Toolkit for building global programs Analysis: How do we reason about the Grid globally? —G-Commerce —Systemwide efficiency, stability, etc.
4
Fortune Telling Grid resource performance varies dynamically —Machines, networks and storage systems are shared by competing applications —Federation Either the system or the application itself must “tolerate” performance variation —Dynamic scheduling Scheduling requires a prediction of future performance levels —What performance level will be deliverable?
5
Skepticism Is it really possible to predict future performance levels? —Self-similarity —Non-stationarity —With what accuracy? —For how long into the future? NWS On-line, semi non-parametric time series techniques —Use running tabulation of forecast error to choose between competing forecasters —Bandwidth, latency, CPU load, available memory, battery power
6
What About Machine Availability?
7
The “Normal” Approach Each measurement is modeled as a “sample” from a random variable —Time invariant —IID (independent, identically distributed) —Stationary (IID forever) Well studied in the literature —Exponential distributions Compose well Memoryless Popular in database and fault-tolerance communities —Pareto distributions Potentially related to self-similarity “heavy-tailed” implying non-predictability Popular in networking, Internet, and Dist. System communities
8
Our “Abnormal” Approach Measure availability as “lifetime” in a variety of settings —Student lab at UCSB, Condor pool New NWS availability sensors —Data used in fault-tolerance community for checkpointing research Predicting optimal checkpoint Develop robust software for MLE parameter estimation Fit Exponential, Pareto, and Weibull distributions Compare the fits —Visually —Goodness of fit tests Goal is to provide an automated mechanism for the NWS —Let the best distribution win
9
UCSB Student Computing Labs Approximately 85 machines running Red Hat Linux located in three separate buildings Open to all Computer Science graduate and undergraduates —Only graduates have building keys Power-switch is not protected —Anyone with physical access to the machine can reboot it by power cycling it Students routinely “clean off” competing users or intrusive processes to gain better performance response NWS deployed and monitoring duration between restarts Can we model the time-to-reboot?
10
UCSB Empirical CDF
11
MLE Weibull Fit to UCSB Data
12
Comparing Fits at UCSB
13
The Visual Acid Test
14
More Systems Condor: Cycle harvesting system (M. Livny, U. Wisconsin) —Workstations in a “pool” run the (trusted) Condor daemons —When a machine running a Condor job becomes “busy” Job is terminated (vanilla universe) —Unknown and constantly changing number of workstations in UWisc Condor Pool (~ 1000 Linux Workstations) Long, Muir, Golding Internet Survey (1995) —Pinged the rpc.statd as a heartbeat —Used extensive in fault-tolerance community to model host failure —1170 hosts covering 3 months of Spring
16
The Condor Picture April 2003 through Oct 2004, 600 hosts
17
More Condor April 2003 through July 2005, 900 hosts
18
Condor Clusters April 2003 through July 2005, 730 hosts
19
Condor Non-cluster April 2003 through July 2005, 170 hosts
20
Modeling Lessons Machine availability looks like it is well-modeled by a Weibull, but Condor process lifetime is trickier —Hyper-exponentials do well, but hard to fit and use —Log-normal looks better in the large (need to investigate more) —May be able to do piece-wise fit for “desktops” Who should care? —Grid simulators Availability is critical —P2P systems Oceanstore, CAN, TAPESTRY, etc. all assume very basic availability distributions in their proofs —Replication systems It does not mean, that model fitting works best for predicting availability => data shortage
21
Predicting Individual Machine Behavior Estimating Mean Time to Failure (MTTF) is relatively easy —Unless the data is Pareto, the mean is the “expected value” —Probably not what is needed to support scheduling The cost of being below the mean is not the same as the cost of being above it “At least how much time will elapse before this machine reboots with 95% certainty?” —The answer is the 0.05 quantile (not an expectation) from the cumulative distribution function (CDF)
22
Certainty in an Uncertain World Predictions of the form —“For at least how long with this machine be available with X% certainty?” Requires two estimates if certainty is to be quantified —Estimate the (1-X) quantile for the distribution of availability => Q x —Estimate the lower X% confidence bound on the statistic Q x => Q (x,lb) If the estimates are unbiased, and the distribution is stationary, future availability duration will be larger than Q (x,lb) X% of the time, guaranteed
23
Neo-classical Methods The classical (parametric) method has some drawbacks —Which distribution? —MLE is computationally challenging or impossible for some distributions and/or data sets —Requires quite a bit of data to get a good fit —Quantiles near the tails are “squeezed” so fit error is significant —Estimating confidence bounds for high-order models is computationally (and theoretically) difficult Non-parametric techniques —Can usually only recover a statistic and not the distribution —Those that appeal to the CLT may have an asymptote problem New non-parametric invention based on Binomial assumptions
24
Experiments in Fortune Telling CSIL, Condor (2 years), and Long data sets Split into training and experimental periods —Use only machines with 20 training samples or more —Using synthetic data we noticed that be best method needed at least 20 samples Use methods to estimate 95% confidence on 0.05 quantile from training period Record success if 95%-100% of the remaining experimental availability durations >= estimate Report % success (want to see 95%)
25
Non-parametric methods seem to work Data Set MLE Weibull Method BootstrappingBinomial Method CSIL93.3%96.7%96.6% Condor99.5%97.1%95.0% Long91.7%96.0%97.3% Weibull over-estimates the tail for Condor data Bootstrapping works okay, but is very computationally expensive
26
On-going Work with Condor Checkpoint scheduling —Parametric method reduces network load dramatically Applications —LDPC investigation => lowest observed error rates —Ramsey search —GridSAT —UCSBGrid Automatic Program Overlay —On-demand Condor as a grid programming middleware NWS Condor Integration —Publishing NWS forecasts via Hawkeye —Incorporating machine availability predictor
27
Thanks and More Miron Livny and the Condor group at the University of Wisconsin Darrell Long (UCSC) and James Plank (UTK) UCSB Facilities Staff NSF SCI and DOE Middleware and Applications Yielding Heterogeneous Environments for Metacomputing at UCSB Students: Matthew Allen, Wahid Chrabakh, Ryan Garver, Andrew Mutz, Dan Nurmi, Erik Peterson, Fred Tu, Lamia Youseff, Ye Wen Research Staff: John Brevik, Graziano Obertelli www.cs.ucsb.edu/~rich rich@cs.ucsb.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.