Kun-chan Lan National ICT Australia John Heidemann USC/ISI

Kun-chan Lan National ICT Australia John Heidemann USC/ISI
On the feasibility of utilizing correlations between user populations for traffic inference Kun-chan Lan National ICT Australia John Heidemann USC/ISI 2/22/2019 LCN 2005

Need for traffic inference
Need to integrate data from multiple points to get a network-wide view of traffic Difficulty in collecting packet-level data at every single router An OC48 link can generate 100GB data per hour Indirect measurements is needed to reduce measurement cost

Network tomography One form of traffic inference (Zhang `00, Medina `02, Zhang `03) using link load statistics to estimate point-to-point matrix via SNMP Can be solved by linear programming, bayesian, EM, etc c c d d b b 4 30% 3 5 50% 20% 1 2 a a (a) (b)

Our work Also in the context of traffic inference
Our approach: utilize correlations between user populations for traffic inference to reduce measurement overhead Infer traffic of network A based on measurements taken from network B given that trafficA and trafficB are correlated 2/22/2019 LCN 2005

Traffic correlations Examples of traffic correlations Research issues
Organization membership (Padmanabhan ‘00) Web caching (Wang ‘99) Diurnal pattern (Paxson ‘94) Sharing of common resource (Zhang `90, Lan ‘01) Research issues When traffic is correlated: traffic aggregation? similar users? How traffic is correlated: temporal? spatial? Which traffic parameters are correlated? 2/22/2019 LCN 2005

Outline of the talk Temporal and spatial correlations between user populations The effect of heavy-hitter flows on the observed correlation Utilize correlations between user populations for traffic inference Conclusion and Future work 2/22/2019 LCN 2005

Data from similar networks
Goal: can we infer traffic at A based on traffic at B if A and B are similar networks? Similar networks Networks with similar user populations More formal definitions later Two subnets of a research institute, ISI AI division and Networking Division Users are mainly researchers Two subnets of an University, USC SAL and EEB Users are mainly students from CS and EE Number of users in USC traces is 2.5 times of that of ISI 2/22/2019 LCN 2005

Traces Larger number of users in UNI traces
Web/TCP is the dominant traffic in both traces RES: Information Science Institute UNI: Univ. of Southern California 2/22/2019 LCN 2005

how traffic is correlated
Focus on web traffic in this study Traffic parameters of web traffic User behavior Number of pages requested per user Page inter-arrival time (i.e. user ‘think’ time) Page Number of objects within each page Object inter-arrival time Object Size of object Distributions of traffic parameters are obtained from RAMP user behavior parameters application- specific temporal spatial

CDFs ns RAMP tcpdump trace Network characteristics Bottleneck link BW
User behavior Bottleneck link BW RTT Number of nodes Number of users User arrival # of page per user Page arrival # of object per page Object arrival Object size CDFs ns web model parameters page arrival object size . tcpdump trace RAMP RApid Model Parameterization

ISI USC Temporal correlation in the same network => temporal regularity in user behavior? 2/22/2019 LCN 2005

ISI USC variation in tail Spatial correlation between “similar” networks => similar users tend to access similar documents? stronger correlation between USC subnets => Larger number of users in USC traces => effect of traffic aggregation 2/22/2019 LCN 2005

Cause of the variation in the tail?
before after # of object per page object size object size flows > 1MB after removing flows > 1MB Heavy-hitter flows Flows that consume most of the network bandwidth distribution of number of object per page is bi-modal The difference in the tail is less significant once heavy-hitter flows are removed Mean of data above 99% quantile before: ISI-a (583KB), ISI-b (2672KB) after: ISI-a (382KB), ISI-b (479KB)

Reduce the effect of heavy-hitters
=> Increase the level of traffic aggregation × 2.5 Larger user population 3 hours 6 hours 12 hours Longer measurement period

Utilize correlations between user populations for traffic inference
model traffic (T) as T=f(N,U,A) N: number of user U: user-behavior parameters (eg. user “think” time) A: application-specific parameters (eg. object size) Our approaches Based on initial measurements at t0 confirming the “similarity” between network n1 and n2 Use future measurements of n2 to predict the traffic in n1 at t1 and t2 Assuming the correlations between n1 and n2 remains relatively unchanged over time Approximation of tail behavior 2/22/2019 LCN 2005

n1 n2 time f(Nn1t0,Un1t0,An1t0) f(Nn2t0,Un2t0,An2t0)
α g() f(Nn2t1,Un2t1,An2t1) f(Nn2t2,Un2t2,An2t2) n2 t0 t1 t2 time Derive N, U and A via RAMP Test the similarity between An1 and An2 Derived  and g() can be used to predict future traffic of n1 2/22/2019 LCN 2005

Similarity test Strictly similar Similarity function
Normalize the tested distributions first Test if two distributions are significantly different in mean (Student’s t-Test) , variance (F-Test) and shape (K-S test) Two distributions are strictly similar if they pass all three tests at 99% confidence level Similarity function s=w1m + w2v + w3D m=|(N1) - (N2)|/|MAX((N1),(N2))| v=|(N1) - (N2)|/|MAX((N1), (N2))| : mean, : variance, D: Kolmogorov-Smirnov D value, N1 and N2: data samples 2/22/2019 LCN 2005

Evaluations Take traces from n1 and n2 (tracen1 and tracen2)
Run similarity tests over tracen1 and tracen2 for each traffic parameter if tracen1 and tracen2 pass the similarity tests Take another trace from n2, tracen2new Derive a simulation model of n1 (modeln1) based on tracen2new via RAMP Generate synthetic traffic of n1 (tracen1syn) using modeln1 Compare tracen1syn against tracen2new First-order statistics, higher-order statistics (wavelet scaling plot) 2/22/2019 LCN 2005

distribution of object size
Case 1 Goal: infer n1 (network div.) model using n2 (AI div.) trace value results Student’s t Test m=0.01 pass (99% confidence level) F-Test v=0.009 K-S test D=0.011 fail (critical value = ) distribution of object size output of n1 model flow size flow duration

distribution of object size
Case 2 Goal: infer n1(network div.) model using n2 (business office) trace distribution of object size value results Student’s t Test m=0.16 fail F-Test v=0.9 K-S test D=0.049 output of n1 model flow size flow duration

How to deal with variations in the tail?
Difficult to predict/simulate the tail behavior Require long simulation time to reach steady state Our approximation Model body and tail separately c: cutoff point between body and tail Approximate the tail behavior with a constant value d d: the possible maximum value during the simulation for object size distribution, d = bottleneck link BW × duration of simulation Cumulative probability fT(x) fupper: fB(x), x < c q(c), c < x < d 1, x > d flower: fB(x), x < c 1, x > c q(c) fB(x) c d object size

Effect of tail approximation
Effect of c on simulation: the minimum q(c) required? We look at q(c) at 99%, 98%, 97%, 96%, 95% Performance metrics Total BW generated during the entire simulation: deviation=(BWmodel-BWtrace)/BWtrace Wavelet scaling plots Our approximation performs well when q(c) is above 99%

Open issues Applicability of our approach for larger time scales
We looked at the time scale of a few hours Applicability of our approach for larger networks Could traffic from two different POPs still exhibit “similar” traffic statistics due to traffic aggregation? Non-trivial to compute the correlation function g() between different networks for our traces, g(y)=cx Similarity test We looked at only simple first-order statistics Higher-order statistics comparison might be necessary 2/22/2019 LCN 2005

Conclusion Based on traces of web traffic, we observed traffic can be correlated between similar networks We showed the effect of heavy-hitter traffic on the tail We showed the effect of traffic aggregation on the correlation of traffic distributions We present a methodology to infer traffic at places where continuously taking measurements is infeasible We evaluated our methodology via simulations 2/22/2019 LCN 2005

Question? 2/22/2019 LCN 2005

..Approximation of tail behavior
If the original trace can be described by a Pareto distribution, i.e. ftrace=/x+1 E(flower) – E(ftrace) = (/c)-1/(-1) + c(/c) E(fupper) – E(ftrace) = (/c)-1/(-1) + d(/c) + (d-c)(1-(/c)) The difference is smaller when c increases or when d decreases 2/22/2019 LCN 2005

Kun-chan Lan National ICT Australia John Heidemann USC/ISI

Similar presentations

Presentation on theme: "Kun-chan Lan National ICT Australia John Heidemann USC/ISI"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Kun-chan Lan National ICT Australia John Heidemann USC/ISI

Similar presentations

Presentation on theme: "Kun-chan Lan National ICT Australia John Heidemann USC/ISI"— Presentation transcript:

Similar presentations

About project

Feedback