© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Maximum Likelihood Method
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Fast Algorithms For Hierarchical Range Histogram Constructions
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
1 BIS APPLICATION MANAGEMENT INFORMATION SYSTEM Advance forecasting Forecasting by identifying patterns in the past data Chapter outline: 1.Extrapolation.
ESTIMATION AND HYPOTHESIS TESTING
Chapter 5 Time Series Analysis
Data Sources The most sophisticated forecasting model will fail if it is applied to unreliable data Data should be reliable and accurate Data should be.
Curve-Fitting Regression
PART 7 Constructing Fuzzy Sets 1. Direct/one-expert 2. Direct/multi-expert 3. Indirect/one-expert 4. Indirect/multi-expert 5. Construction from samples.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Evaluating Hypotheses
Chapter 3 Forecasting McGraw-Hill/Irwin
Development of Empirical Models From Process Data
Sampling Distributions
CHAPTER 6 Statistical Analysis of Experimental Data
Part III: Inference Topic 6 Sampling and Sampling Distributions
Forecasting McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Statistical Intervals Based on a Single Sample.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Time Series and Forecasting Chapter 16.
Time Series and Forecasting
Traffic modeling and Prediction ----Linear Models
Inference for regression - Simple linear regression
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-3 Regression.
Chapter 13 Statistics © 2008 Pearson Addison-Wesley. All rights reserved.
Business Forecasting Used to try to predict the future Uses two main methods: Qualitative – seeking opinions on which to base decision making – Consumer.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
© 2008 Pearson Addison-Wesley. All rights reserved Chapter 1 Section 13-6 Regression and Correlation.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
1 RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON NEURAL NETWORKS Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, TX 75265
DAVIS AQUILANO CHASE PowerPoint Presentation by Charlie Cook F O U R T H E D I T I O N Forecasting © The McGraw-Hill Companies, Inc., 2003 chapter 9.
Curve-Fitting Regression
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
Inference for Regression Chapter 14. Linear Regression We can use least squares regression to estimate the linear relationship between two quantitative.
Chapter 6 Business and Economic Forecasting Root-mean-squared Forecast Error zUsed to determine how reliable a forecasting technique is. zE = (Y i -
Time series Decomposition Farideh Dehkordi-Vakil.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Chapter 10 Algorithm Analysis.  Introduction  Generalizing Running Time  Doing a Timing Analysis  Big-Oh Notation  Analyzing Some Simple Programs.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
CpSc 881: Machine Learning Evaluating Hypotheses.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Chapter 10 Correlation and Regression 10-2 Correlation 10-3 Regression.
CE 3354 ENGINEERING HYDROLOGY Lecture 6: Probability Estimation Modeling.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Analysis of financial data Anders Lundquist Spring 2010.
11-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
© 2012 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
The simple linear regression model and parameter estimation
The Maximum Likelihood Method
12. Principles of Parameter Estimation
Correlation and Simple Linear Regression
Chapter 11: Simple Linear Regression
Chapter 4: Seasonal Series: Forecasting and Decomposition
The Maximum Likelihood Method
The Maximum Likelihood Method
Introduction to Instrumentation Engineering
Regression Models - Introduction
10701 / Machine Learning Today: - Cross validation,
Parametric Methods Berlin Chen, 2005 References:
12. Principles of Parameter Estimation
Presentation transcript:

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Mining Port-level IP Traffic Data Errol Caby AT&T Labs

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Outline IP traffic metrics Exploring the relationship between IP traffic metrics Classifying IP traffic patterns Making IP traffic projections 1

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. IP Traffic Metrics IP services such as VPN (Virtual Private Network) are provided through ports which are identified by IP address and circuit ID. High utilization levels (high traffic levels compared to the port’s bandwidth) may cause degradation in these services. Consequently, it is of value to analyze IP traffic data at the port level to identify/predict those ports that currently have high utilization or will have high utilization within a given period of time. Two IP traffic metrics: Monthly utilization – The monthly utilization of a circuit is the average of the daily peak utilization for the month where utilization measures the fraction/percent of bandwidth used. Hours of over-utilization – The hours of over-utilization of a circuit is the length of time (in hours) that the utilization exceeds a specified threshold in a month 2

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Exploring The Relationship Between The Two IP Traffic Metrics Let m 1 and m 2 denote the monthly utilization and hours of over-utilization metrics, respectively. (That is, if x is a port, then m 1 (x) and m 2 (x) will denote its monthly utilization and hours of over-utilization, respectively.) We would like to examine the relationship between m 1 and m 2, in particular, we would like to find a mapping f such that m 1 (x) = f(m 2 (x)) for any port x. The challenge The data that was available consisted of the two traffic metrics evaluated on disjoint sets of ports, i.e., monthly utilization was calculated for one set and hours of over-utilization was calculated for a different disjoint set. 3

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Exploring The Relationship Between The Two IP Traffic Metrics (cont.) A definition – consistency: Let n 1 and n 2 be two metrics on a set W, we will say that n 1 and n 2 are consistent on W if n 1 (u) < n 1 (v) if and only if n 2 (u) < n 2 (v) where u and v are in W. Assume that m 1 and m 2 (the monthly utilization and hours of over-utilization, respectively) are consistent on the set of ports y for which m 2 (y) > 0, then some consequences are the following: if Y is a set of ports y with m 2 (y) > 0 for all y in Y, then f maps the p th percentile in {m 2 (y) | y in Y} into the p th percentile in {m 1 (y) | y in Y}, i.e., f maps percentiles into corresponding percentiles. furthermore, if Y is a set of ports with m 2 (y) > 0 for all y in Y and if X is a set of ports on which m 1 has been evaluated such that {m 1 (x) | x in X} and {m 1 (y) | y in Y} can be considered to be samples from the same distribution (note that the values in {m 1 (y) | y in Y} are assumed to be unknown but the values in {m 1 (x) | x in X} are known), then the mapping f can be determined from the above result. That is, if m 2 (y 0 ) is the p th percentile in {m 2 (y) | y in Y}, then f(m 2 (y 0 )) can be estimated by the p th percentile in {m 1 (x) | x in X}. 4

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Illustration – Exploring The Relationship Between The IP Traffic Metrics At The Circuit Level Plot of estimated points of the mapping f. 5 Over-Utilization Hrs Monthly Utilization

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Illustration – Exploring The Relationship Between The IP Traffic Metrics At The Circuit Level (cont.) A closed form of the mapping f may be estimated through curve fitting. A good fit was found using a curve of the form 6 Over-Utilization Hrs Monthly Utilization

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Mining IP Traffic Patterns Objective – devise an algorithm for mining the time series history of the monthly utilization for a large number of ports that: classifies the time series pattern for each port forecasts the monthly utilization a number of months out in the future port by port in order to identify ports whose utilization would soon exceed the over-utilization threshold A desirable quality is that the algorithm be simple so that it runs quickly and so that there are few requirements on the computing environment (e.g., it does not require any sophisticated computing platform). 7

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Normalizing Utilization The IP environment is dynamic; bandwidth may change. Consequently, since monthly utilization expresses the percent of the bandwidth used, adjustments to the monthly utilization are needed to get the true pattern of the traffic. This can be done by normalizing monthly utilization, expressing it in terms of a single bandwidth for the entire time period considered. 8

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Example – Normalizing Traffic Patterns The plot on the left is the original time series of monthly utilization; the plot on the right is the normalized monthly utilization. Note that the patterns are different. 9 Month Utilization Month Adj. Utilization

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Traffic Pattern Classification (cont.) In describing a port’s traffic, the curve (from a small set of families of curves) that is the closest to the traffic time series is then found. This curve together with the root- mean-square error describes the traffic pattern (the curve gives the general trend of the traffic; the root-mean-square error captures the fluctuation about this trend). The traffic pattern, consequently, can be classified according to the family to which it belongs. For simplicity, the families of curves considered were 2-parameter families of the form y = a*f(x) + b where f is a function of x. It was found that the following three functions f(x) = x, f(x) = x 2 and f(x) = log e (x) were sufficient to capture many of the patterns occurring. The resulting three families of curves being: y = a*x + b --constant growth rate y = a*x 2 + b --increasing growth rate y = a*log e (x) + b --slowing growth rate Since the curves (models) are linear in the parameters, the best-fitting curve in a family can be found by the usual least squares technique. Also, note that since the curves all have two parameters, the best fitting curve can be found by choosing the one that minimizes R 2. 10

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. IP Traffic Projection To evaluate how well the three families of curves succeeded in describing/differentiating traffic patterns and how well they predicted future traffic, - the set of points (say n points) in the available time series were divided into two sets, the first n – k and the last k points, where k < n – k. - the curve (from all three families of curves) that best fitted the first n – k points, i.e. minimized R 2, was selected as the one describing the traffic pattern. - the mean absolute error between this curve and the traffic time series, calculated for the last k points, was then compared with the corresponding mean absolute errors for the best-fitting curves (based on the first n – k points) from the other two classes of curves. 11

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. IP Traffic Projection – Example 1 The best-fitting curve to the first 17 points is of the form y = a*log e (x) + b The mean absolute error between this curve and the last 5 points of the traffic time series is smaller than the mean absolute errors of the best-fitting curves from the other families. 12

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. IP Traffic Projection – Example 2 The best-fitting curve to the first 17 points is of the form y = a*x 2 + b The mean absolute error between this curve and the last 5 points of the traffic time series is smaller than the mean absolute errors of the best-fitting curves from the other families. 13

© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual Property and/or AT&T affiliated companies. All other marks contained herein are the property of their respective owners. Conclusion Testing the algorithm on a small set of ports have yielded results that suggest that the three families of 2-parameter curves may be sufficient to capture the key elements of the traffic patterns. Full evaluation awaits the full-scale implementation of the algorithm. 14