Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.

Slides:

Advertisements

Similar presentations

Analysis of Computer Algorithms

Advertisements

A small taste of inferential statistics

Network Weather Service Sathish Vadhiyar Sources / Credits: NWS web site: NWS papers.

Performance Testing - Kanwalpreet Singh.

MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.

1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.

The Network Weather Service A Distributed Resource Performance Forecasting Service for Metacomputing Rich Wolski, Neil T. Spring and Jim Hayes Presented.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

Statistics & Modeling By Yan Gao. Terms of measured data Terms used in describing data –For example: “mean of a dataset” –An objectively measurable quantity.

Efficient and Robust Computation of Resource Clusters in the Internet Efficient and Robust Computation of Resource Clusters in the Internet Chuang Liu,

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

The Network Weather Service: A Distributed Resource Performance Forecasting Service for Metacomputing, Rich Wolski, Neil Spring, and Jim Hayes, Journal.

Performance Evaluation

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Total Quality Management BUS 3 – 142 Statistics for Variables Week of Mar 14, 2011.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.

Monté Carlo Simulation MGS 3100 – Chapter 9. Simulation Defined A computer-based model used to run experiments on a real system.  Typically done on a.

New Challenges in Cloud Datacenter Monitoring and Management

Chapter 5 Sampling Distributions

1 Reading Report 9 Yin Chen 29 Mar 2004 Reference: Multivariate Resource Performance Forecasting in the Network Weather Service, Martin Swany and Rich.

MCTS Guide to Microsoft Windows 7

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

Chapter 5 The Lure of Statistics: Data Mining Using Familiar Tools Note: Included in this Slide Set is a subset of Chapter 5 material and additional material.

1 LECTURE 6 Process Measurement Business Process Improvement 2010.

Part III Taking Chances for Fun and Profit

Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.

Chapter 21 Basic Statistics.

Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University.

Bust a Move Young MC. Modeling and Predicting Machine Availability in Volatile Computing Environments Rich Wolski John Brevik Dan Nurmi University of.

Sampling Distributions Adapted from Exploring Statistics with the TI-83 by Gail Burrill, Patrick Hopfensperger, Mike Koehler.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Copyright  2003 by Dr. Gallimore, Wright State University Department of Biomedical, Industrial Engineering & Human Factors Engineering Human Factors Research.

Simulation techniques Summary of the methods we used so far Other methods –Rejection sampling –Importance sampling Very good slides from Dr. Joo-Ho Choi.

SPRUCE Special PRiority and Urgent Computing Environment Advisor Demo Nick Trebon University of Chicago Argonne National Laboratory

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.

1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.

: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.

Copyright © 2005 Pearson Education, Inc. Slide 6-1.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

©2011 Brooks/Cole, Cengage Learning Elementary Statistics: Looking at the Big Picture 1 Lecture 7: Chapter 4, Section 3 Quantitative Variables (Summaries,

Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.

A Study on Policy-Based Interaction Techniques with Autonomic Computing Peter Khooshabeh University of California, Santa Barbara 1 Department of Psychology,

Holding slide prior to starting show. Scheduling Parametric Jobs on the Grid Jonathan Giddy

Building Valid, Credible & Appropriately Detailed Simulation Models

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

VGES Demonstrations Andrew A. Chien, Henri Casanova, Yang-suk Kee, Richard Huang, Dionysis Logothetis, and Jerry Chou CSE, SDSC, and CNS University of.

Confidence Intervals Cont.

OPERATING SYSTEMS CS 3502 Fall 2017

BAE 6520 Applied Environmental Statistics

BAE 5333 Applied Water Resources Statistics

Performance and Fault Tolerance

Data Mining: Concepts and Techniques

Resource Characterization

Pick up the Pieces Average White Band.

Predicting Queue Waiting Time For Individual User Jobs

CPSC 531: System Modeling and Simulation

Chapter 10 Verification and Validation of Simulation Models

Statistical Methods Carey Williamson Department of Computer Science

ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC

Carey Williamson Department of Computer Science University of Calgary

Presented By: Darlene Banta

Presentation transcript:

Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit April 28, 2005

VGrADS Vision

VGrADS Functional Decomposition (so far) Grid Services Virtual Grid Program Preparation SI EMAN LEAD GridSAT Information Services Resource Access vgES Programming Tools VG lobus MDS Cluster NetworkArchiveRadarMicroscopeCluster

VGrADS Information Services Resources vgES Monitoring And Data Management Resource Characterization Data Abstractions

Resource Characterization Abstract description of resources in terms of program accessible attributes Quantitative Approach: use automatic statistical methods to capture the dynamics of changing resource characteristics —Monitor data is plentiful and noisy —Must summarize the quantitative behavior of each resource –Reduce the complexity associated with using quantitative performance and reliability readings –Summaries must be statistically “reliable” to enable effective program-based reasoning and debugging (confidence measures) Key Questions: —Can we make effective quantitative characterizations? —Can we deliver the characterizations scalably and fast enough?

Characterization Research Leverage previous work: —NWS makes time series predictions for characteristics that are well-modeled as continuously changing levels –Network BW and Latency (end-to-end) –CPU load –Available memory New Research: focus on characteristics that do not fit time- series models well —Resource availability and failure prediction –Predicted duration-until-next-failure as a quantitative characterization –Used to schedule checkpoints (Poster by Dan Nurmi) —Batch Queue Wait time prediction –New approach to an old problem

Batch Queue Wait Time Goal: Rigorous confidence bounds on the amount of time a specific job will wait in a batch queue before it is scheduled on a cluster or parallel machine. —Statistical nature implies that a quantifiable confidence range is necessary —Need an answer that applies to an individual job => not the “average” or “expected” wait time Twenty-year old problem —Job execution times and inter-arrival times are heavy tailed —Policies non-stationary and often ad hoc Previous work —Smith, Taylor, Foster, IPDPS, —Downey, IPPS 1997 —Feitelson,

Modeling and Prediction are Different Estimating the full wait-time distribution is fun but hard —From the distribution, calculating an expectation is possible —Sample average is an unbiased, robust estimate of the mean —Unless the data is Pareto, the mean is the “expected value” —Probably not what a user or scheduler needs –The cost of being below the mean is not the same as the cost of being above it “At most how long will I have to wait before my job runs?” —The answer is a percentile “At most how long will I have to wait before my job runs with 95% confidence?” —The answer is the 95th percentile from the cumulative distribution function (CDF) of the queue wait time distribution Goal: estimate the percentiles without fitting a distribution Better Goal: estimate percentiles and quantified confidence bounds —Statistical guarantee at specified confidence levels

The Brevik Method John Brevik’s invention based on Binomial distribution —Probability that exactly j values are below qth quantile is Probability that k or fewer values are less than the qth quantile Very robust requiring few sample points (not understood) —Standard Normal approximation does not work Requires multi-precision arithmetic to calculate because n and j can be quite large

How Well Does it Work? Examine the batch queue logs that record wait time Choose a quantile and a confidence level —0.95 quantile with 95% confidence For each job —Calculate the upper limit on the quantile —Observe whether job’s wait time is less than that limit For the entire trace, record the percentage of job wait times that are less than the prediction —Value should be less than quantile if method is working 5 sites and machines (NERSC, LANL, LLNL, SDSC, TACC) 9 years (96 through 05) 1,000,000+ jobs

Quantifiable Confidence

Capturing Dynamics

Adapting to Non-stationarity

Choosing the Best Worst Case

Choosing the Best Middle Case

Choosing the Best Number of Processors

A Batch of Results Brevik Method can predict quantiles with specified levels of confidence —Must control history adaptively to handle non-stationarity —Robust and data frugal enough to work for processor counts too (much harder) Combinations of quantiles provide a qualitative way to evaluate resources —If median and 95th percentile are lower, chances are job will start sooner Quantiles provide a quantitative way to predict possible outcomes —45% chance that a job will start between the median and the 95th percentile Possibly New Scheduling Research: Quantitative Contingency Scheduling —Build a schedule with contingencies based on quantiles —Adjust based on conditional predictions

Delivering the Good News Quick slide on the new VGrADS interface for the NWS 100x100 “virtual grids” delivered to the application level in 6 seconds Need a visual

Conclusions New automatic resource characterizations —New approach to batch queue and machine availability —Lead to new scheduling techniques (c.f. Dan Nurmi Poster) —Quantifiable confidence levels Result: better abstract representation of resources and their dynamics New Information System data structures —Scalable and high performance —Provide an instantaneous “picture” of the resources Result: Virtualization in the Information System promotes scalability and performance