Workload Modeling and its Effect on Performance Evaluation Dror Feitelson Hebrew University.

Slides:



Advertisements
Similar presentations
Fast Algorithms For Hierarchical Range Histogram Constructions
Advertisements

G. Alonso, D. Kossmann Systems Group
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 17 Scheduling III.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
The Forgotten Factor: FACTS on Performance Evaluation and its Dependence on Workloads Dror Feitelson Hebrew University.
JSSPP-11, Boston, MA June 19, Pitfalls in Parallel Job Scheduling Evaluation Designing Parallel Operating Systems using Modern Interconnects Pitfalls.
11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.
CMPT 855Module Network Traffic Self-Similarity Carey Williamson Department of Computer Science University of Saskatchewan.
On the Self-Similar Nature of Ethernet Traffic - Leland, et. Al Presented by Sumitra Ganesh.
Statistics & Modeling By Yan Gao. Terms of measured data Terms used in describing data –For example: “mean of a dataset” –An objectively measurable quantity.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Performance Evaluation
What we will cover…  CPU Scheduling  Basic Concepts  Scheduling Criteria  Scheduling Algorithms  Evaluations 1-1 Lecture 4.
1 Validation and Verification of Simulation Models.
Meta-analysis & psychotherapy outcome research
Experimental Evaluation
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Standard Error of the Mean
Self-Similarity of Network Traffic Presented by Wei Lu Supervised by Niclas Meier 05/
1 Chapters 9 Self-SimilarTraffic. Chapter 9 – Self-Similar Traffic 2 Introduction- Motivation Validity of the queuing models we have studied depends on.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
1 Performance Evaluation of Computer Networks: Part II Objectives r Simulation Modeling r Classification of Simulation Modeling r Discrete-Event Simulation.
Chapter 6: CPU Scheduling
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-3 CPU Scheduling Department of Computer Science and Software Engineering.
Chapter 4 – Modeling Basic Operations and Inputs  Structural modeling: what we’ve done so far ◦ Logical aspects – entities, resources, paths, etc. 
Ch 8 Estimating with Confidence. Today’s Objectives ✓ I can interpret a confidence level. ✓ I can interpret a confidence interval in context. ✓ I can.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Generic Approaches to Model Validation Presented at Growth Model User’s Group August 10, 2005 David K. Walters.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Section 10.1 Confidence Intervals
Chapter 10 Verification and Validation of Simulation Models
Question paper 1997.
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
Inference: Probabilities and Distributions Feb , 2012.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Test Loads Andy Wang CIS Computer Systems Performance Analysis.
Basic Concepts Maximum CPU utilization obtained with multiprogramming
Selecting Input Probability Distributions. 2 Introduction Part of modeling—what input probability distributions to use as input to simulation for: –Interarrival.
Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.
Stats Methods at IC Lecture 3: Regression.
OPERATING SYSTEMS CS 3502 Fall 2017
CPU SCHEDULING.
Dror Feitelson Hebrew University
Workload Modeling and its Effect on Performance Evaluation
Chapter 6: CPU Scheduling
Chapter 10 Verification and Validation of Simulation Models
Statistical Methods For Engineers
CPU Scheduling Basic Concepts Scheduling Criteria
CPU Scheduling G.Anuradha
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 5: CPU Scheduling
Objective of This Course
Operating System Concepts
3: CPU Scheduling Basic Concepts Scheduling Criteria
Chapter 6: CPU Scheduling
Lecture 2 Part 3 CPU Scheduling
Chapter 6: CPU Scheduling
Experimental Computer Science: Focus on Workloads
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Presentation transcript:

Workload Modeling and its Effect on Performance Evaluation Dror Feitelson Hebrew University

Performance Evaluation In system design –Selection of algorithms –Setting parameter values In procurement decisions –Value for money –Meet usage goals For capacity planing

The Good Old Days… The skies were blue The simulation results were conclusive Our scheme was better than theirs Feitelson & Jette, JSSPP 1997

But in their papers, Their scheme was better than ours!

How could they be so wrong?

The system’s design (What we teach in algorithms and data structures) Its implementation (What we teach in programming courses) The workload to which it is subjected The metric used in the evaluation Interactions between these factors Performance evaluation depends on:

The system’s design (What we teach in algorithms and data structures) Its implementation (What we teach in programming courses) The workload to which it is subjected The metric used in the evaluation Interactions between these factors Performance evaluation depends on:

Outline for Today Three examples of how workloads affect performance evaluation Workload modeling –Getting data –Fitting, correlations, stationarity… –Heavy tails, self similarity… Research agenda In the context of parallel job scheduling

Example #1 Gang Scheduling and Job Size Distribution

Gang What?!? Time slicing parallel jobs with coordinated context switching Ousterhout matrix Ousterhout, ICDCS 1982

Gang What?!? Time slicing parallel jobs with coordinated context switching Ousterhout matrix Optimization: Alternative scheduling Ousterhout, ICDCS 1982

Packing Jobs Use a buddy system for allocating processors Feitelson & Rudolph, Computer 1990

Packing Jobs Use a buddy system for allocating processors

Packing Jobs Use a buddy system for allocating processors

Packing Jobs Use a buddy system for allocating processors

Packing Jobs Use a buddy system for allocating processors

The Question: The buddy system leads to internal fragmentation But it also improves the chances of alternative scheduling, because processors are allocated in predefined groups Which effect dominates the other?

The Answer (part 1): Feitelson & Rudolph, JPDC 1996

The Answer (part 2):

Many small jobs Many sequential jobs Many power of two jobs Practically no jobs use full machine Conclusion: buddy system should work well

Verification Feitelson, JSSPP 1996

Example #2 Parallel Job Scheduling and Job Scaling

Variable Partitioning Each job gets a dedicated partition for the duration of its execution Resembles 2D bin packing Packing large jobs first should lead to better performance But what about correlation of size and runtime?

Scaling Models Constant work –Parallelism for speedup: Amdahl’s Law –Large first  SJF Constant time –Size and runtime are uncorrelated Memory bound –Large first  LJF –Full-size jobs lead to blockout Worley, SIAM JSSC 1990

“Scan” Algorithm Keep jobs in separate queues according to size (sizes are powers of 2) Serve the queues Round Robin, scheduling all jobs from each queue (they pack perfectly) Assuming constant work model, large jobs only block the machine for a short time But the memory bound model would lead to excessive queueing of small jobs Krueger et al., IEEE TPDS 1994

The Data

Data: SDSC Paragon, 1995/6

The Data Data: SDSC Paragon, 1995/6

The Data Data: SDSC Paragon, 1995/6

Conclusion Parallelism used for better results, not for faster results Constant work model is unrealistic Memory bound model is reasonable Scan algorithm will probably not perform well in practice

Example #3 Backfilling and User Runtime Estimation

Backfilling Variable partitioning can suffer from external fragmentation Backfilling optimization: move jobs forward to fill in holes in the schedule Requires knowledge of expected job runtimes

Variants EASY backfilling Make reservation for first queued job Conservative backfilling Make reservation for all queued jobs

User Runtime Estimates Lower estimates improve chance of backfilling and better response time Too low estimates run the risk of having the job killed So estimates should be accurate, right?

They Aren’t Mu’alem & Feitelson, IEEE TPDS 2001

Surprising Consequences Inaccurate estimates actually lead to improved performance Performance evaluation results may depend on the accuracy of runtime estimates –Example: EASY vs. conservative –Using different workloads –And different metrics

EASY vs. Conservative Using CTC SP2 workload

EASY vs. Conservative Using Jann workload model

EASY vs. Conservative Using Feitelson workload model

Conflicting Results Explained Jann uses accurate runtime estimates This leads to a tighter schedule EASY is not affected too much Conservative manages less backfilling of long jobs, because respects more reservations

Conservative is bad for the long jobs Good for short ones that are respected Conservative EASY

Conflicting Results Explained Response time sensitive to long jobs, which favor EASY Slowdown sensitive to short jobs, which favor conservative All this does not happen at CTC, because estimates are so loose that backfill can occur even under conservative

Verification Run CTC workload with accurate estimates

But What About My Model? Simply does not have such small long jobs

Workload Data Sources

No Data Innovative unprecedented systems –Wireless –Hand-held Use an educated guess –Self similarity –Heavy tails –Zipf distribution

Serendipitous Data Data may be collected for various reasons –Accounting logs –Audit logs –Debugging logs –Just-so logs Can lead to wealth of information

NASA Ames iPSC/860 log jobs from Oct-Dec 1993 user job nodes runtime date time user4 cmd /10/93 10:13:17 user4 cmd /10/93 10:19:30 user42 nqs /10/93 10:22:07 user41 cmd /10/93 10:22:37 sysadmin pwd /10/93 10:22:42 user4 cmd /10/93 10:25:42 sysadmin pwd /10/93 10:30:43 user41 cmd /10/93 10:31:32 Feitelson & Nitzberg, JSSPP 1995

Distribution of Job Sizes

Distribution of Resource Use

Degree of Multiprogramming

System Utilization

Job Arrivals

Arriving Job Sizes

Distribution of Interarrival Times

Distribution of Runtimes

User Activity

Repeated Execution

Application Moldability

Distribution of Run Lengths

Predictability in Repeated Runs

Recurring Findings Many small and serial jobs Many power-of-two jobs Weak correlation of job size and duration Job runtimes are bounded but have CV>1 Inaccurate user runtime estimates Non-stationary arrivals (daily/weekly cycle) Power-law user activity, run lengths

Instrumentation Passive: snoop without interfering Active: modify the system –Collecting the data interferes with system behavior –Saving or downloading the data causes additional interference –Partial solution: model the interference

Data Sanitation Strange things happen Leaving them in is “safe” and “faithful” to the real data But it risks situations in which a non- representative situation dominates the evaluation results

Arrivals to SDSC SP2

Arrivals to LANL CM-5

Arrivals to CTC SP2

Arrivals to SDSC Paragon What are they doing at 3:30 AM?

3:30 AM Nearly every day, a set of 16 jobs are run by the same user Most probably the same set, as they typically have a similar pattern of runtimes Most probably these are administrative jobs that are executed automatically

Arrivals to CTC SP2

Arrivals to SDSC SP2

Arrivals to LANL CM-5

Arrivals to SDSC Paragon

Are These Outliers? These large activity outbreaks are easily distinguished from normal activity They last for several days to a few weeks They appear at intervals of several months to more than a year They are each caused by a single user! –Therefore easy to remove

Two Aspects In workload modeling, should you include this in the model? –In a general model, probably not –Conduct separate evaluation for special conditions (e.g. DOS attack) In evaluations using raw workload data, there is a danger of bias due to unknown special circumstances

Automation The idea: –Cluster daily data in based on various workload attributes –Remove days that appear alone in a cluster –Repeat The problem: –Strange behavior often spans multiple days Cirne &Berman, Wkshp Workload Charact. 2001

Workload Modeling

Statistical Modeling Identify attributes of the workload Create empirical distribution of each attribute Fit empirical distribution to create model Synthetic workload is created by sampling from the model distributions

Fitting by Moments Calculate model parameters to fit moments of empirical data Problem: does not fit the shape of the distribution

Jann et al, JSSPP 1997

Fitting by Moments Calculate model parameters to fit moments of empirical data Problem: does not fit the shape of the distribution Problem: very sensitive to extreme data values

Effect of Extreme Runtime Values Change when top records omitted omitmeanCV 0.01%-2.1%-29% 0.02%-3.0%-35% 0.04%-3.7%-39% 0.08%-4.6%-39% 0.16%-5.7%-42% 0.31%-7.1%-42% Downey & Feitelson, PER 1999

Alternative: Fit to Shape Maximum likelihood: what distribution parameters were most likely to lead to the given observations –Needs initial guess of functional form Phase type distributions –Construct the desired shape Goodness of fit –Kolmogorov-Smirnov: difference in CDFs –Anderson-Darling: added emphasis on tail –May need to sample observations

Correlations Correlation can be measured by the correlation coefficient It can be modeled by a joint distribution function Both may not be very useful

Correlation Coefficient systemCC CTC SP KTH SP SDSC SP LANL CM SDSCParagon0.305 Gives low results for correlation of runtime and size in parallel systems

Distributions A restricted version of a joint distribution

Modeling Correlation Divide range of one attribute into sub- ranges Create a separate model of other attribute for each sub-range Models can be independent, or model parameter can depend on sub-range

Stationarity Problem of daily/weekly activity cycle –Not important if unit of activity is very small (network packet) –Very meaningful if unit of work is long (parallel job)

How to Modify the Load Multiply interarrivals or runtimes by a factor –Changes the effective length of the day Multiply machine size by a factor –Modifies packing properties Add users

Stationarity Problem of daily/weekly activity cycle –Not important if unit of activity is very small (network packet) –Very meaningful if unit of work is long (parallel job) Problem of new/old system –Immature workload –Leftover workload

Heavy Tails

Tail Types When a distribution has mean m, what is the distribution of samples that are larger than x? Light: expected to be smaller than x+m Memoryless: expected to be x+m Heavy: expected to be larger than x+m

Formal Definition Tail decays according to a power law Test: log-log complementary distribution

Consequences Large deviations from the mean are realistic Mass disparity –small fraction of samples responsible for large part of total mass –Most samples together account for negligible part of mass Crovella, JSSPP 2001

Unix File Sizes Survey, 1993

Unix File Sizes LLCD

Consequences Large deviations from the mean are realistic Mass disparity –small fraction of samples responsible for large part of total mass –Most samples together account for negligible part of mass Infinite moments –For mean is undefined –For variance is undefined Crovella, JSSPP 2001

Pareto Distribution With parameter the density is proportional to The expectation is then i.e. it grows with the number of samples

Pareto Samples

Effect of Samples from Tail In simulation: –A single sample may dominate results –Example: response times of processes In analysis: –Average long-term behavior may never happen in practice

Real Life Data samples are necessarily bounded The question is how to generalize to the model distribution –Arbitrary truncation –Lognormal or phase-type distributions –Something in between

Solution 1: Truncation Postulate an upper bound on the distribution Question: where to put the upper bound Probably OK for qualitative analysis May be problematic for quantitative simulations

Solution 2: Model the Sample Approximate the empirical distribution using a mixture of exponentials (e.g. phase- type distributions) In particular, exponential decay beyond highest sample In some cases, a lognormal distribution provides a good fit Good for mathematical analysis

Solution 3: Dynamic Place an upper bound on the distribution Location of bound depends on total number of samples required Example: Note: does not change during simulation

Self Similarity

The Phenomenon The whole has the same structure as certain parts Example: fractals

The Phenomenon The whole has the same structure as certain parts Example: fractals In workloads: burstiness at many different time scales Note: relates to a time series

Job Arrivals to SDSC Paragon

Process Arrivals to SDSC Paragon

Long-Range Correlation A burst of activity implies that values in the time series are correlated A burst covering a large time frame implies correlation over a long range This is contrary to assumptions about the independence of samples

Aggregation Replace each subsequence of m consecutive values by their mean If self-similar, the new series will have statistical properties that are similar to the original (i.e. bursty) If independent, will tend to average out

Poisson Arrivals

Tests Essentially based on the burstiness-retaining nature of aggregation Rescaled range (R/s) metric: the range (sum) of n samples as a function of n

R/s Metric

Tests Essentially based on the burstiness-retaining nature of aggregation Rescaled range (R/s) metric: the range (sum) of n samples as a function of n Variance-time metric: the variance of an aggregated time series as a function of the aggregation level

Variance Time Metric

Modeling Self Similarity Generate workload by an on-off process –During on period, generate work at steady pace –During off period to nothing On and off period lengths are heavy tailed Multiplex many such sources Leads to long-range correlation

Research Areas

Effect of Users Workload is generated by users Human users do not behave like a random sampling process –Feedback based on system performance –Repetitive working patterns

Feedback User population is finite Users back off when performance is inadequate Negative feedback Better system stability Need to explicitly model this behavior

Locality of Sampling Users display different levels of activity at different times At any given time, only a small subset of users is active

Active Users

Locality of Sampling Users display different levels of activity at different times At any given time, only a small subset of users is active These users repeatedly do the same thing Workload observed by system is not a random sample from long-term distribution

SDSC Paragon Data

Growing Variability

SDSC Paragon Data

Locality of Sampling The questions: How does this effect the results of performance evaluation? Can this be exploited by the system, e.g. by a scheduler?

Hierarchical Workload Models Model of user population –Modify load by adding/deleting users Model of a single user’s activity –Built-in self similarity using heavy-tailed on/off times Model of application behavior and internal structure –Capture interaction with system attributes

A Small Problem We don’t have data for these models Especially for user behavior such as feedback –Need interaction with cognitive scientists And for distribution of application types and their parameters –Need detailed instrumentation

Final Words…

We like to think that we design systems based on solid foundations…

But beware: the foundations might be unbased assumptions!

We should have more “science” in computer science: Collect data rather than make assumptions Run experiments under different conditions Make measurements and observations Make predictions and verify them Share data and programs to promote good practices and ensure comparability Computer Systems are Complex

Advice from the Experts “Science if built of facts as a house if built of stones. But a collection of facts is no more a science than a heap of stones is a house” -- Henri Poincaré

Advice from the Experts “Science if built of facts as a house if built of stones. But a collection of facts is no more a science than a heap of stones is a house” -- Henri Poincaré “Everything should be made as simple as possible, but not simpler” -- Albert Einstein

Acknowledgements Students: Ahuva Mu’alem, David Talby, Uri Lublin Larry Rudolph / MIT Data in Parallel Workloads Archive –Joefon Jann / IBM –Allen Downey / Welselley –CTC SP2 log / Steven Hotovy –SDSC Paragon log / Reagan Moore –SDSC SP2 log / Victor Hazelwood –LANL CM-5 log / Curt Canada –NASA iPSC/860 log / Bill Nitzberg