Download presentation
Presentation is loading. Please wait.
1
CMU SCS Data Mining Meets Systems: Tools and Case Studies Christos Faloutsos SCS CMU
2
CMU SCS PDL 2008C. Faloutsos#2 Thanks Spiros Papadimitriou (CMU->IBM) Mengzhi Wang (CMU->Google) Jimeng Sun (CMU -> IBM)
3
CMU SCS PDL 2008C. Faloutsos#3 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank
4
CMU SCS PDL 2008C. Faloutsos#4 Problem #1: Goal: given a signal (eg., #bytes over time) Find: patterns, periodicities, and/or compress time #bytes Bytes per 30’ (packets per day; earthquakes per year)
5
CMU SCS PDL 2008C. Faloutsos#5 Problem #1 model bursty traffic generate realistic traces (Poisson does not work) time # bytes Poisson
6
CMU SCS PDL 2008C. Faloutsos#6 Motivation predict queue length distributions (e.g., to give probabilistic guarantees) “learn” traffic, for buffering, prefetching, ‘active disks’, web servers
7
CMU SCS PDL 2008C. Faloutsos#7 Q: any ‘pattern’? time # bytes Not Poisson spike; silence; more spikes; more silence… any rules?
8
CMU SCS PDL 2008C. Faloutsos#8 solution: self-similarity # bytes time # bytes
9
CMU SCS PDL 2008C. Faloutsos#9 But: Q1: How to generate realistic traces; extrapolate? Q2: How to estimate the model parameters?
10
CMU SCS PDL 2008C. Faloutsos#10 Approach Q1: How to generate a sequence, that is –bursty –self-similar –and has similar queue length distributions
11
CMU SCS PDL 2008C. Faloutsos#11 Approach A: ‘binomial multifractal’ [Wang+02] ~ 80-20 ‘law’: –80% of bytes/queries etc on first half –repeat recursively b: bias factor (eg., 80%)
12
CMU SCS PDL 2008C. Faloutsos#12 binary multifractals 20 80
13
CMU SCS PDL 2008C. Faloutsos#13 binary multifractals 20 80
14
CMU SCS PDL 2008C. Faloutsos#14 Parameter estimation Q2: How to estimate the bias factor b?
15
CMU SCS PDL 2008C. Faloutsos#15 Parameter estimation Q2: How to estimate the bias factor b? A: MANY ways [Crovella+96] –Hurst exponent –variance plot –even DFT amplitude spectrum! (‘periodogram’) –More robust: ‘entropy plot’ [Wang+02] Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang, Spiros Papadimitriou and Christos Faloutsos, Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty TrafficFast Algorithms for Modeling Bursty Traffic, ICDE 2002
16
CMU SCS PDL 2008C. Faloutsos#16 Entropy plot Rationale: – burstiness: inverse of uniformity –entropy measures uniformity of a distribution –find entropy at several granularities, to see whether/how our distribution is close to uniform.
17
CMU SCS PDL 2008C. Faloutsos#17 Entropy plot Entropy E(n) after n levels of splits n=1: E(1)= - p1 log 2 (p1)- p2 log 2 (p2) p1p2 % of bytes here
18
CMU SCS PDL 2008C. Faloutsos#18 Entropy plot Entropy E(n) after n levels of splits n=1: E(1)= - p1 log(p1)- p2 log(p2) n=2: E(2) = - p 2,i * log 2 (p 2,i ) p 2,1 p 2,2 p 2,3 p 2,4
19
CMU SCS PDL 2008C. Faloutsos#19 Real traffic Has linear entropy plot (-> self-similar) # of levels (n) Entropy E(n) 0.73
20
CMU SCS PDL 2008C. Faloutsos#20 Observation - intuition: intuition: slope = intrinsic dimensionality =~ ‘degrees of freedom’ or info-bits per coordinate-bit –unif. Dataset: slope =1 –multi-point: slope = 0 # of levels (n) Entropy E(n) 0.73
21
CMU SCS PDL 2008C. Faloutsos#21 Entropy plot - Intuition Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Pick a point; reveal its coordinate bit-by-bit - how much info is each bit worth to me? Skip
22
CMU SCS PDL 2008C. Faloutsos#22 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? ‘info’ value = E(1): 1 bit Skip
23
CMU SCS PDL 2008C. Faloutsos#23 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? Is next MSB =0? Skip
24
CMU SCS PDL 2008C. Faloutsos#24 Entropy plot Slope ~ intrinsic dimensionality (in fact, ‘Information fractal dimension’) = info bit per coordinate bit - eg Dim = 1 Is MSB 0? Is next MSB =0? Info value =1 bit = E(2) - E(1) = slope! Skip
25
CMU SCS PDL 2008C. Faloutsos#25 Entropy plot Repeat, for all points at same position: Dim=0 Skip
26
CMU SCS PDL 2008C. Faloutsos#26 Entropy plot Repeat, for all points at same position: we need 0 bits of info, to determine position -> slope = 0 = intrinsic dimensionality Dim=0 Skip
27
CMU SCS PDL 2008C. Faloutsos#27 Entropy plot Real (and 80-20) datasets can be in- between: bursts, gaps, smaller bursts, smaller gaps, at every scale Dim = 1 Dim=0 0<Dim<1 Skip
28
CMU SCS PDL 2008C. Faloutsos#28 (Fractals, again) What set of points could have behavior between point and line?
29
CMU SCS PDL 2008C. Faloutsos#29 Cantor dust Eliminate the middle third Recursively!
30
CMU SCS PDL 2008C. Faloutsos#30 Cantor dust
31
CMU SCS PDL 2008C. Faloutsos#31 Cantor dust
32
CMU SCS PDL 2008C. Faloutsos#32 Cantor dust
33
CMU SCS PDL 2008C. Faloutsos#33 Cantor dust
34
CMU SCS PDL 2008C. Faloutsos#34 Dimensionality? (no length; infinite # points!) Answer: log2 / log3 = 0.6 Cantor dust
35
CMU SCS PDL 2008C. Faloutsos#35 Some more entropy plots: Poisson vs real Poisson: slope = ~1 -> uniformly distributed 1 0.73
36
CMU SCS PDL 2008C. Faloutsos#36 B-model b-model traffic gives perfectly linear plot Lemma: its slope is slope = -b log 2 b - (1-b) log 2 (1-b) Fitting: do entropy plot; get slope; solve for b E(n) n
37
CMU SCS PDL 2008C. Faloutsos#37 Experimental setup Disk traces (from HP [Wilkes 93]) web traces from LBL http://repository.cs.vt.edu/ lbl-conn-7.tar.Z
38
CMU SCS PDL 2008C. Faloutsos#38 Model validation Linear entropy plots Bias factors b: 0.6-0.8 smallest b / smoothest: nntp traffic
39
CMU SCS PDL 2008C. Faloutsos#39 Web traffic - results LBL, NCDF of queue lengths (log-log scales) (queue length l) Prob( >l)
40
CMU SCS PDL 2008C. Faloutsos#40 Conclusions Multifractals (80/20, ‘b-model’, Multiplicative Wavelet Model (MWM)) for analysis and synthesis of bursty traffic
41
CMU SCS PDL 2008C. Faloutsos#41 Books Fractals: Manfred Schroeder: Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise W.H. Freeman and Company, 1991 (Probably the BEST book on fractals!)
42
CMU SCS PDL 2008C. Faloutsos#42 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop)
43
CMU SCS PDL 2008C. Faloutsos#43 Clusters/data center monitoring Monitor correlations of multiple measurements Automatically flag anomalous behavior Intemon: intelligent monitoring system –warsteiner.db.cs.cmu.edu/demo/intemon.jsp
44
CMU SCS PDL 2008C. Faloutsos#44 Publication Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006
45
CMU SCS PDL 2008C. Faloutsos#45 Under the hood: SVD Singular Value Decomposition Done incrementally Spiros Papadimitriou, Jimeng Sun and Christos Faloutsos Streaming Pattern Discovery in Multiple Time-Series VLDB 2005, Trondheim, Norway.
46
CMU SCS PDL 2008C. Faloutsos#46 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) LSI: S. Dumais; M. Berry KL: eg, Duda+Hart PCA: eg., Jolliffe Details: [Press+] u of CPU1 u of CPU2 t=1 t=2
47
CMU SCS PDL 2008C. Faloutsos#47 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2
48
CMU SCS PDL 2008C. Faloutsos#48 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2
49
CMU SCS PDL 2008C. Faloutsos#49 Singular Value Decomposition (SVD) SVD (~LSI ~ KL ~ PCA ~ spectral analysis...) u of CPU1 u of CPU2 t=1 t=2
50
CMU SCS PDL 2008C. Faloutsos#50 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop)
51
CMU SCS PDL 2008C. Faloutsos#51 BGP updates With Aditya Prakash (CMU) Michalis Faloutsos (UC Riverside) Nicholas Valler (UC Riverside) Dave Andersen (CMU)
52
CMU SCS PDL 2008C. Faloutsos#52 Time Series: #Updates per 600s, Washington Router 09/2004- 09/2006 Tool #0: Time plot
53
CMU SCS PDL 2008C. Faloutsos#53 Tool #0: Time plot Observation #1: Missing values Observation #2: Bursty
54
CMU SCS PDL 2008C. Faloutsos#54 Tool #1: Wavelets
55
CMU SCS PDL 2008C. Faloutsos#55 Wavelets - DWT Short window Fourier transform (SWFT) But: how short should be the window? time freq time value
56
CMU SCS PDL 2008C. Faloutsos#56 Wavelets - DWT Answer: multiple window sizes! -> DWT time freq Time domain DFT SWFT DWT
57
CMU SCS PDL 2008C. Faloutsos#57 Haar Wavelets subtract sum of left half from right half repeat recursively for quarters, eight-ths,...
58
CMU SCS PDL 2008C. Faloutsos#58 ‘Tornado Plot’ for Washington Router: Dark areas correspond to high energy Low freq. High freq. time
59
CMU SCS PDL 2008C. Faloutsos#59 Tornado Plot: Wavelet Transform for Washington Router 09/2004-09/2006, All coefficients and Detail levels 1-12 Observations: 1.Obvious Spikes (E1): tornados that “touch down” 2. Prolonged Spikes (E2 and E3): when coarser scales have high values but finer scales do not 3.Intermittent Waves (E4 and E5): High-energy entries at nearby scales correspond to local periodic motion
60
CMU SCS PDL 2008C. Faloutsos#60 E2: Prolonged Spike Sustained Period of relatively high Activity Magnification of updates on 28 th Aug. 2005 time # updates
61
CMU SCS PDL 2008C. Faloutsos#61 Tool #2: logarithms
62
CMU SCS PDL 2008C. Faloutsos#62 Tool #2: logarithms Prominent `clothesline’ at ~ 50 updates per 600 secs. Culprit IP addresses: 192.211.42.0/24 216.109.38.0/24 207.157.115.0/24 All from Alabama (Supercomputing Center)!
63
CMU SCS PDL 2008C. Faloutsos#63 Outline Problem 1: workload characterization Problem 2: self-* monitoring Problem 3: BGP mining (Problem 4: sensor mining) (Problem 5: Large graphs & hadoop) fractals SVD wavelets tensors PageRank
64
CMU SCS PDL 2008C. Faloutsos#64 Main point Two-way street: <- DM can use such infrastructures to find patterns -> DM can help such systems/networks etc to become self-healing, self-adjusting, ‘self-*’ Hot topic in Data Mining: finding patterns in Tera- and Peta-bytes
65
CMU SCS PDL 2008C. Faloutsos#65 Additional resources Machine learning classes at SCS/MLD Tom Mitchell’s book on Machine Learning –Classification –Clustering/Anomaly detection –Support vector machines –Graphical models –Bayesian networks –
66
CMU SCS PDL 2008C. Faloutsos#66 www.cs.cmu.edu/~christos For code, papers etc WeH 7107 christos cs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.