Synthesizing Representative I/O Workloads for TPC-H J. Zhang*, A. Sivasubramaniam*, H. Franke, N. Gautam*, Y. Zhang, S. Nagar * Pennsylvania State University.

Slides:



Advertisements
Similar presentations
By Venkata Sai Pulluri ( ) Narendra Muppavarapu ( )
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Forecasting Using the Simple Linear Regression Model and Correlation
Clustering V. Outline Validating clustering results Randomization tests.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
STAT 497 APPLIED TIME SERIES ANALYSIS
Generating Synthetic Workloads Using Iterative Distillation Zachary Kurmas – Georgia Tech Kimberly Keeton – HP Labs Kenneth Mackenzie – Reservoir Labs.
ANALYZING STORAGE SYSTEM WORKLOADS Paul G. Sikalinda, Pieter S. Kritzinger {psikalin, DNA Research Group Computer Science Department.
Chapter 13 Additional Topics in Regression Analysis
An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.
Storage system designs must be evaluated with respect to many workloads New Disk Array Performance (CDF of latency) seconds % I/Os seconds % I/Os seconds.
On the Constancy of Internet Path Properties Yin Zhang, Nick Duffield AT&T Labs Vern Paxson, Scott Shenker ACIRI Internet Measurement Workshop 2001 Presented.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Lecture 3: A Case for RAID (Part 1) Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Simulation.
1 Validation and Verification of Simulation Models.
Modeling spatially-correlated sensor network data Apoorva Jindal, Konstantinos Psounis Department of Electrical Engineering-Systems University of Southern.
Berkeley Parlab 1. INTRODUCTION A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing 2. CORRELATIONS TO THE GROUND.
Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.
1 Simulation Modeling and Analysis Output Analysis.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
Realistic CPU Workloads Through Host Load Trace Playback Peter A. Dinda David R. O’Hallaron Carnegie Mellon University.
Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;
Today Evaluation Measures Accuracy Significance Testing
ECON 6012 Cost Benefit Analysis Memorial University of Newfoundland
Toolbox for Dimensioning Windows Storage Systems Jalil Boukhobza, Claude Timsit 12/09/2006 Versailles Saint Quentin University.
Simulation Output Analysis
© 2006 IBM Corporation Adaptive Self-Tuning Memory in DB2 Adam Storm, Christian Garcia-Arellano, Sam Lightstone – IBM Toronto Lab Yixin Diao, M. Surendra.
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
The Effects of Ranging Noise on Multihop Localization: An Empirical Study from UC Berkeley Abon.
Traffic Modeling.
Chapter 6 : Software Metrics
Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,
Chanyoung Park Raphael T. Haftka Paper Helicopter Project.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
Hung X. Nguyen and Matthew Roughan The University of Adelaide, Australia SAIL: Statistically Accurate Internet Loss Measurements.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
GEOSTATISICAL ANALYSIS Course: Special Topics in Remote Sensing & GIS Mirza Muhammad Waqar Contact: EXT:2257.
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web (Book, chap. 6)
© 2003, Carla Ellis Self-Scaling Benchmarks Peter Chen and David Patterson, A New Approach to I/O Performance Evaluation – Self-Scaling I/O Benchmarks,
Chapter 10 Verification and Validation of Simulation Models
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
1 Exploiting Nonstationarity for Performance Prediction Christopher Stewart (University of Rochester) Terence Kelly and Alex Zhang (HP Labs)
Chapter 10 Verification and Validation of Simulation Models Banks, Carson, Nelson & Nicol Discrete-Event System Simulation.
A table of diagnostic positions and depth index multipliers for the Sphere (see your handout). Note that regardless of which diagnostic position you use,
Prophet/Critic Hybrid Branch Prediction B B B
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
1
Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.
OPERATING SYSTEMS CS 3502 Fall 2017
Chapter 10 Verification and Validation of Simulation Models
Markov Chain Monte Carlo
Qingbo Zhu, Asim Shankar and Yuanyuan Zhou
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
A new way of looking at emission uncertainties
Presentation transcript:

Synthesizing Representative I/O Workloads for TPC-H J. Zhang*, A. Sivasubramaniam*, H. Franke, N. Gautam*, Y. Zhang, S. Nagar * Pennsylvania State University IBM T.J. Watson Rutgers University

Outline Motivation Related Work Methodology –Arrival Time –Access Pattern –Request Sizes Accuracy of synthetic traces Concluding Remarks

Motivation I/O subsystems are critical for commercial services and in production environments. Real applications are essential for system design and evaluation. TPC-H is a decision-support workload for business enterprises.

Disadvantages of Traces Not easily obtainable Can be very large Difficult to get statistical confidence Very difficult to change workload behavior Does not isolate the influence of one parameter On the other hand, a deeper understanding of the workload can: Help generate a synthetic workload Help in system design itself.

What do we need to synthesize? Inter-arrival times (temporal behavior) of disk block requests. Access pattern (spatial behavior) of blocks being referenced Size (volume) of each I/O request.

Related work Scientific Application I/O behavior –Time-series models for arrivals –Sequentiality/Markov models for access pattern Commercial/production workloads –Self-similar arrival patterns –Sequentiality in TPC-H/TPC-D No prior complete synthesis of all three attributes for TPC-H

Our TPC-H Workload Trace Collection Platform –IBM Netfinity 8-way SMP with 2.5GB memory and 15 disks –Linux –DB2 UDB EE V7.2 TPC-H Configuration –Power Run of 22 queries –Partitioning tables across the disks –30 GB dataset

Validation Identify characteristics Disksim 2.0 Original I/O traces Generate synthetic traces Response time CDF  RMS: root-mean-square error of differences between two CDF curves  nRMS: RMS/m, m is average response time for the original trace Metrics

Overall Methodology Arrival pattern characteristics –Investigate correlations Time series Self-similar iid distributions Access pattern characteristics –Sequentiality/pseudo sequentiality/randomness –Size characteristics Investigating correlations between time, space and volume to get final synthesis

Arrival pattern Statistical analysis –Auto-correlation function (ACF) plots Shows the correlation between current inter-arrival time and one that is x-steps away

–Correlations seem very weak (<0.15 for 12 queries, and <0.30 for the rest) Errors with Time series models ( AR/MA/ARIMA/ARFIMA) are high No suggestions for self-similar either –Perhaps iid (independent and identically distributed) is not a bad assumption.

Fitting distributions –Tried hyper-exponential/normal/pareto –Used Maximum Likelihood Estimator (normal/pareto) and Expectation Maximization (hyper-exponential) to estimate distribution parameters –Use K-S test to measure goodness-of-fit –Maximum distance between fitted distribution and original CDF was ensured to be less than 0.1

Comparing CDF of fitted distribution and data

Access Pattern (Location + Size) Most studies use sequentiality to describe TPC-H However, this is not always the case. Cat1: Q10 Q4, Q14 Cat2: Q12, Q1,Q3,Q5,Q7, Q8,Q15,Q18, Q19,Q21 Cat3: Q20 Q9, Q17 Arrival Time Location Arrival Time

Category 1: Intermingling sequential streams Consider the following: –Run: A strictly sequential set of I/O requests –Stream: A pseudo-sequential set of I/O requests that could be interrupted by another stream. –i.e. a stream could have several runs that are interrupted by runs of other streams.

Run and Stream An example run of 5 requests A stream (pseudo-sequential) of 4 requests An example trace: Stream A Stream B Trace 9-12

Secondary Attributes Run Length: # of requests in a run Run Start location: start sector of run Stream Length: # of requests in a stream Inter-stream Jump Distance: spatial separation between start of run and previous request Intra-stream Jump Distance: spatial separation between successive requests within a stream Number of active streams (at any instant) Interference Distance: number of requests between 2 successive requests in a stream Derive empirical distributions for these from the trace

Location Synthesis - Q10 (Time and size from trace)  LocIID: locations are i.i.d.  LocRUN: incorporate run length distribution and run start location distribution.  LocSTREAM: combine all stream and run statistics.

Request Size Requests are one of –64, 128, 192, 256, 320, 384, 448, 512 blocks But attributes (location, size, time) are not independent !!!

Correlations between size and location Size All req. Run start Within run Fraction of requests

Correlations between size and time

Correlations between location and time

Final Synthesis Methodology (Category 1)  Location: use LocSTREAM to generate start locations. Two kinds of requests: a run start request or a request within a run  Time: use Pr(inter-arrival time | run start requests) and Pr(inter-arrival time | within a run requests) to generate times.  Size: 1)For run start request, use Pr(size | inter-arrival times of run start requests) to generate sizes. 2)For within a run requests, use Pr(size | within a run requests) to generate sizes.

Can be easily adapted for Category 2 (strictly sequential) and Category 3 (random) queries. Validation: Compare the response time characteristics of synthesized and real trace.

Validation of CDF of response times (Category 1)

Validation of CDF of response times (Category 2)

Validation of CDF of response times (Category 3)

Storage Requirements Q1Q3Q4Q5Q6Q7Q8Q9Q Storage Fraction(x0.001) nRMS Q12Q14Q15Q17Q18Q19Q20Q Storage Fraction(x0.001) nRMS

Contributions A synthesis methodology to capture –Inter-mingling streams of requests –Exploiting correlations between request attributes An application of this methodology to TPC-H Along the way (for TPC-H), –iid can capture arrival time characteristics –Strict sequentiality is not always the case

Backup slides

Validating arrival time synthesis

LocSTREAM 1.Use Pr(stream length) to generate stream lengths. 2.Use Pr(run length | stream length) to generate run lengths for each stream length. 3.Generate start location for each run: a) Use Pr(inter-stream jump dist.) to generate the start location of the first run in the stream. b) Use Pr(intra-stream jump distance | this stream) to generate other runs’ start location in this stream. 4.Use Pr(interference distance) to interleave all streams.