Trace-based Network Bandwidth Analysis and Prediction Yi QIAO 06/10/2002.

Slides:



Advertisements
Similar presentations
Autocorrelation Functions and ARIMA Modelling
Advertisements

Part II – TIME SERIES ANALYSIS C5 ARIMA (Box-Jenkins) Models
Analysis of Sales of Food Services & Drinking Places Julianne Shan Ho-Jung Hsiao Christian Treubig Lindsey Aspel Brooks Allen Edmund Becdach.
Correlation Chapter 9.
CMPT 855Module Network Traffic Self-Similarity Carey Williamson Department of Computer Science University of Saskatchewan.
On the Self-Similar Nature of Ethernet Traffic - Leland, et. Al Presented by Sumitra Ganesh.
Error Propagation. Uncertainty Uncertainty reflects the knowledge that a measured value is related to the mean. Probable error is the range from the mean.
Modeling Host Load Peter A. Dinda Thesis Seminar 2/9/98.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
1 Yi Qiao Jason Skicewicz Peter A. Dinda Prescience Laboratory Department of Computer Science Northwestern University Evanston, IL An Empirical Study.
Recent Results in Resource Signal Measurement, Dissemination, and Prediction App Transport Network Data Link Physical App Transport Network Data Link Physical.
ARIMA Forecasting Lecture 7 and 8 - March 14-16, 2011
Simple Linear Regression Analysis
Statistical Evaluation of Data
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
AN INTRODUCTION TO PORTFOLIO MANAGEMENT
Relationships Among Variables
Hydrologic Statistics
BOX JENKINS METHODOLOGY
Traffic modeling and Prediction ----Linear Models
1 Chapters 9 Self-SimilarTraffic. Chapter 9 – Self-Similar Traffic 2 Introduction- Motivation Validity of the queuing models we have studied depends on.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
TIME SERIES by H.V.S. DE SILVA DEPARTMENT OF MATHEMATICS
Oceanography 569 Oceanographic Data Analysis Laboratory Kathie Kelly Applied Physics Laboratory 515 Ben Hall IR Bldg class web site: faculty.washington.edu/kellyapl/classes/ocean569_.
Lecture Presentation Software to accompany Investment Analysis and Portfolio Management Seventh Edition by Frank K. Reilly & Keith C. Brown Chapter 7.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
CS910: Foundations of Data Analytics Graham Cormode Time Series Analysis.
Texture. Texture is an innate property of all surfaces (clouds, trees, bricks, hair etc…). It refers to visual patterns of homogeneity and does not result.
UNDERSTANDING RESEARCH RESULTS: DESCRIPTION AND CORRELATION © 2012 The McGraw-Hill Companies, Inc.
METHODS IN BEHAVIORAL RESEARCH NINTH EDITION PAUL C. COZBY Copyright © 2007 The McGraw-Hill Companies, Inc.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Multiple Linear Regression. Purpose To analyze the relationship between a single dependent variable and several independent variables.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
The Statistical Analysis of Data. Outline I. Types of Data A. Qualitative B. Quantitative C. Independent vs Dependent variables II. Descriptive Statistics.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.
ANOVA, Regression and Multiple Regression March
Curve Fitting Introduction Least-Squares Regression Linear Regression Polynomial Regression Multiple Linear Regression Today’s class Numerical Methods.
Chapter 8 Relationships Among Variables. Outline What correlational research investigates Understanding the nature of correlation What the coefficient.
LESSON 5 - STATISTICS & RESEARCH STATISTICS – USE OF MATH TO ORGANIZE, SUMMARIZE, AND INTERPRET DATA.
Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation Rendong Yang and Zhen Su Division of Bioinformatics,
Forecasting. Model with indicator variables The choice of a forecasting technique depends on the components identified in the time series. The techniques.
Central Bank of Egypt Basic statistics. Central Bank of Egypt 2 Index I.Measures of Central Tendency II.Measures of variability of distribution III.Covariance.
Appendix I A Refresher on some Statistical Terms and Tests.
Chapter 18 Data Analysis Overview Yandell – Econ 216 Chap 18-1.
Lecture 9 Forecasting. Introduction to Forecasting * * * * * * * * o o o o o o o o Model 1Model 2 Which model performs better? There are many forecasting.
Instrumental Analysis Elementary Statistics. I. Significant Figures The digits in a measured quantity that are known exactly plus one uncertain digit.
Chapter 12 Understanding Research Results: Description and Correlation
Chapter 4 Basic Estimation Techniques
MATH-138 Elementary Statistics
Correlation, Bivariate Regression, and Multiple Regression
Data Mining: Concepts and Techniques
Regression Chapter 6 I Introduction to Regression
Chapter 6: Autoregressive Integrated Moving Average (ARIMA) Models
Time-Series Evaluation of an Experimental
Analyzing Redistribution Matrix with Wavelet
Part Three. Data Analysis
Understanding Research Results: Description and Correlation
Basic Statistical Terms
Predict Failures with Developer Networks and Social Network Analysis
Hypothesis testing and Estimation
Correlation and Regression-III
Example Histogram c) Interpret the following histogram that captures the percentage of body-fat in a testgroup [4]:  
Presented by Chun Zhang 2/14/2003
Part I Review Highlights, Chap 1, 2
3.2. SIMPLE LINEAR REGRESSION
Principal Component Analysis (PCA)
CPSC 641: Network Traffic Self-Similarity
Presentation transcript:

Trace-based Network Bandwidth Analysis and Prediction Yi QIAO 06/10/2002

OUTLINE 1.Introduction 2.Data Collection and Transformation 3.Basic Statistical Analysis of Bandwidth 4.Trace Classification 5.Bandwidth Prediction 6.Conclusion

1.Introduction Fact: Network bandwidth is one of the most important characteristics for both WANs and LANs We want to know: What does bandwidth time series looks like? Are there any correlations between bandwidth at different times? Do bandwidth from different traces share any common properties? Is network bandwidth predictable or not? Are there any differences between bandwidth data from long period traces and those from short traces?

Step by step: Trace Collection and Transformation Classification of the Traces Bandwidth Prediction

2. Data Collection and Transformation Three Data Sets: I.NLANR short-period (90 seconds) WAN traces II.AUCKLAND long-period (1 day) WAN traces III. BC Traces, 2 WAN traces and 2 LAN traces

Converting Trace file to Bandwidth Data: Original Trace file (Time Stamp + IP Header + TCP Header) Time Stamp + Packet Length (From IP Header) assign packets to their bins according to their timestamp, and computes instantaneous bandwidth Final Bandwidth File

3. Basic Statistical Analysis After some basic statistical analysis of the bandwidth data, such as mean and maximum value of bandwidth, standard deviation of bandwidth, we get … Correlation Coefficient

CovMax/MeanMin/Mean Bin Size Now, what’s the effect of bin size on these properties? Relationship between Mean, Min and Max Bandwidth Correlation Coefficient

Relationship between bin sizes and COV Relationship between bin sizes and Max/Mean

4. Traces Classification How To? What does the time series plot looks like? What does the shape for the ACF plot looks like? What percentage of ACFs is significant? What best describes the distribution (histogram) of bandwidth? What does the PSD plot looks like? Is it decreasing linearly (in log-log plot) as the frequency increase? Result: 12 Classes for NLANR traces, 8 Classes for AUCKLAND traces.

I.NLANR short period WAN traces classification: A.Class 1: Not predictable, under-utilized ACF: Small value, low percentage ACFs are significant Bandwidth Distribution: Heavy-tailed distribution y=x - α PSD: Flat, contains all-frequency components like white noise. Bin size: 0.001S

Effect of different bin sizes: 0.01S 0.1S 1S Different bin sizes can all give us some useful information We should all these bin sizes for each trace.

B. Class 2: Little predictability, under-utilized ACF: Small value, low percentage significant ACFs Bandwidth Distribution: Multiple heavy-tailed distribution y=x - α PSD: Flat, contains all-frequency components like white noise. Bin size: 0.1S for ACF; 0.001S for other plots

C. Class 2a: No predictability, well-utilized ACF: Small value, low percentage significant ACFs Bandwidth Distribution: Left branch - half a normal distribution; Right-branch – heavy-tailed distribution y=x - α PSD: Flat, contains all-frequency components like white noise. Bin size: 0.1S for ACF; 0.001S for other plots

D. Class 4: Some predictability, under-utilized ACF: Over 50% significant ACFs Bandwidth Distribution: Multiple heavy-tailed distribution in the form of y=x - α PSD: Decreasing linearly in log-log plot as frequency increases; low-frequency components are dominant Bin size: 0.1S for ACF; 0.001S for other plots

E. Class 5: Some predictability, fairly-utilized ACF: Over 50% significant ACFs, high-frequency vibration Bandwidth Distribution: Left branch - half a normal distribution; Right-branch – heavy-tailed distribution y=x - α PSD: A dominant frequency (frequency band) component Bin size: 0.01S for ACF; 0.001S for other plots

II. Auckland long period WAN traces classification: A. Class 1: Good predictability, fairly-utilized ACF: Over 90% significant ACFs, regular and smooth plot Bandwidth Distribution: Two separate parts and two separate peaks, all heavy-tailed PSD: Decreasing linearly in log-log plot as frequency increases; low-frequency components are dominant Bin size: 1 S for all plots

B. Class 1a: Good predictability, fairly-utilized ACF: Over 85% significant ACFs, regular and smooth plot Bandwidth Distribution: Two separate parts and two separate peaks, with large parts overlapping PSD: Decreasing linearly in log-log plot as frequency increases; low-frequency components are dominant Bin size: 1 S for all plots

C. Class 2: Some predictability, well-utilized ACF: Over 70% significant ACFs, with some high frequency fluctuation Bandwidth Distribution: Left branch - half a normal distribution; Right-branch – heavy-tailed distribution y=x - α PSD: Decreasing linearly in log-log plot as frequency increases; low-frequency components are dominant Bin size: 1 S for all plots

III. Tree-based Classification Why do this? Some classes could be very similar to each other while some are quite different. This can be best described by a tree structure. Tree-based classification enables us to do classification at different granularity.

A.Tree-based Classification for NLANR traces

B. Tree-based Classification for Auckland traces

IV. Summary of Traces Classification Summary for NLANR traces (12 classes)

Summary for AUCKLAND traces (8 classes)

Pie Chart for NLANR traces and AUCKLAND traces

What else can we learn? All the long traces have some predictability. Most of the short traces are not predictable. And even for those short traces which are predictable, their predictability are still not as good as long traces. Only a small fraction of short traces could make good use of the bandwidth, while all the long traces have good (or fairly good) utilization of the bandwidth. All traces that are predictable have demonstrated some degree of long-range-dependency, including both short NLANR traces and long AUCKLAND traces.

5. Bandwidth Prediction What do we want to know? What’s the real predictability for each class that we classified? Which prediction model is best suited for bandwidth prediction? What’s the effect of different bin sizes on bandwidth prediction? Prediction models used (part of RPS Toolkit): MEAN, LAST, MA, BM, AR, ARMA, ARIMA, ARFIMA

How to evaluate predictability? Three evaluation criterions: I. The ratio of mean squared error (msqerr) to the variance of testing sequence, that is: II.How well does the error distribution fit the normal distribution? (=1 ideally) III.What percentage of ACFs for prediction error is significant? (=0 ideally)

I.Effectiveness of different predictors A. Bandwidth prediction for NLANR traces Mean squared err/variance of testing sequence Bin size: 0.01 S

Normal Distribution Fit Percentage of error ACFs that are significant Bin size: 0.01 S Bin size: 0.01 S

B. Bandwidth prediction for AUCKLAND traces Mean squared err/variance of testing sequence Bin size: 10 S

Normal Distribution Fit Percentage of error ACFs that are significant Bin size: 10 S Bin size: 10 S

C. Bandwidth prediction for BC traces Mean squared err/variance of testing sequence Bin size: 10 S for 2 WAN traces, 0.1 S for 2 LAN traces

What does bandwidth prediction really look like? An AUCKLAND Trace A NLANR Trace Bin Size: 1000S, 100S, 10S and 1S Bin Size: 1S, 0.1S, 0.01S and 0.001S

D. Observations For almost all classes of traces, AR model can yield the optimal or near optimal prediction results among all the eight predictors that have been tested. For almost all the classes and all the predictors, the error distribution are very close to normal distribution. The value of sigacffrac for AR model are almost the lowest among all predictors for any class. Our expectation of predictability for different classes have been confirmed by real results: All these long traces are predictable, and a large fraction of them have very good predictability. While for short traces, only 20% of them have some predictability. BC traces also have some predictability.

II. Influence of bin size on bandwidth prediction A. NLANR traces (AR 32) Mean squared err/variance of testing sequence at different bin sizes (0.001S, 0.01S, 0.1 S and 1S)

B. AUCKLAND traces (AR 32) Mean squared err/variance of testing sequence at different bin sizes (1S, 10S, 100S and 1000S)

C. Observations For NLANR traces, bin size of 0.1 second gives the best prediction among all the four bin sizes. For most AUCKLAND trace, bin size of 100 second or 10 second can give the best prediction performance among the four bin sizes. For any trace, there probably exists a optimal bin size that can give the best prediction performance.

D. Further Probe For Auckland traces, there are seems to be an optimal bin size for bandwidth prediction… There seems to be an optimal bin size around 20 second Red: a Class 1 trace Green: a Class 1c trace

6. Conclusion Bandwidth traces can be classified based on their time series plot, ACF plot, distribution of bandwidth, and PSD plot. Most long period WAN traces are predictable, with some degree of long-range dependency. A small part of short period WAN traces have some predictability, also with some degree of long-range dependency. The BC LAN traces are also predictable. AR model is an ideal model for prediction because of its accuracy and efficiency. For each trace, there exists an “optimal” bin size where we can get the best prediction performance.

Acknowledgement Many Thanks to Peter, Dong, and Jason!