Download presentation
Presentation is loading. Please wait.
Published byMyles Lindsey Modified over 9 years ago
1
Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU
2
Carnegie Mellon DB/IR '06C. Faloutsos#2 THANK YOU! Prof. Panos Ipeirotis Julia Mills
3
Carnegie Mellon DB/IR '06C. Faloutsos#3 Joint work with Spiros Papadimitriou (CMU->IBM) Jimeng Sun (CMU/CS) Anthony Brockwell (CMU/Stat) Jeanne Vanbriesen (CMU/CivEng) Greg Ganger (CMU/ECE)
4
Carnegie Mellon DB/IR '06C. Faloutsos#4 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions
5
Carnegie Mellon DB/IR '06C. Faloutsos#5 Problem definition - example Each sensor collects data (x 1, x 2, …, x t, …)
6
Carnegie Mellon DB/IR '06C. Faloutsos#6 Problem definition Given: one or more sequences x 1, x 2, …, x t, … (y 1, y 2, …, y t, … … ) Find –patterns; correlations; outliers –incrementally!
7
Carnegie Mellon DB/IR '06C. Faloutsos#7 Limitations / Challenges Find patterns using a method that is nimble: limited resources –Memory –Bandwidth, power, CPU incremental: on-line, ‘any-time’ response – single pass (‘you get to see it only once’) automatic: no human intervention –eg., in remote environments
8
Carnegie Mellon DB/IR '06C. Faloutsos#8 Application domains Sensor devices –Temperature, weather measurements –Road traffic data –Geological observations –Patient physiological data Embedded devices –Network routers –Intelligent (active) disks
9
Carnegie Mellon DB/IR '06C. Faloutsos#9 Motivation - Applications (cont’d) ‘Smart house’ –sensors monitor temperature, humidity, air quality video surveillance
10
Carnegie Mellon DB/IR '06C. Faloutsos#10 Motivation - Applications (cont’d) civil/automobile infrastructure –bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring
11
Carnegie Mellon DB/IR '06C. Faloutsos#11 Motivation - Applications (cont’d) Weather, environment/anti-pollution –volcano monitoring –air/water pollutant monitoring
12
Carnegie Mellon DB/IR '06C. Faloutsos#12 Motivation - Applications (cont’d) Computer systems –‘Active Disks’ (buffering, prefetching) –web servers (ditto) –network traffic monitoring –...
13
Carnegie Mellon InteMon w/ Evan Hoke, Jimeng Sun self-* PetaByte data center at CMU
14
Carnegie Mellon DB/IR '06C. Faloutsos#14 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID conclusions
15
Carnegie Mellon DB/IR '06C. Faloutsos#15 Single sequence mining - AWSOM with Spiros Papadimitriou (CMU -> IBM) Anthony Brockwell (CMU/Stat)
16
Carnegie Mellon DB/IR '06C. Faloutsos#16 Problem definition Semi-infinite streams of values (time series) x 1, x 2, …, x t, … Find patterns, forecasts, outliers… Periodicity? (daily) Periodicity? (twice daily) “Noise”??
17
Carnegie Mellon DB/IR '06C. Faloutsos#17 Requirements / Goals Adapt and handle arbitrary periodic components and nimble (limited resources, single pass) on-line, any-time automatic (no human intervention/tuning)
18
Carnegie Mellon DB/IR '06C. Faloutsos#18 Overview Introduction / Related work Background Main idea Experimental results
19
Carnegie Mellon DB/IR '06C. Faloutsos#19 Wavelets Example – Haar transform t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency t xtxt “constant”
20
Carnegie Mellon DB/IR '06C. Faloutsos#20 Wavelets Why we like them Wavelets compress many real signals well: –Image compression and processing –Vision –Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive
21
Carnegie Mellon DB/IR '06C. Faloutsos#21 Overview Introduction / Related work Background Main idea Experimental results
22
Carnegie Mellon DB/IR '06C. Faloutsos#22 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency =
23
Carnegie Mellon DB/IR '06C. Faloutsos#23 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency
24
Carnegie Mellon DB/IR '06C. Faloutsos#24 AWSOM - idea W l,t W l,t-1 W l,t-2 W l,t l,1 W l,t-1 l,2 W l,t-2 … W l’,t’-1 W l’,t’-2 W l’,t’ W l’,t’ l’,1 W l’,t’-1 l’,2 W l’,t’-2 …
25
Carnegie Mellon DB/IR '06C. Faloutsos#25 More details… Update of wavelet coefficients Update of linear models Feature selection –Not all correlations are significant –Throw away the insignificant ones (“noise”) (incremental) (incremental; RLS) (single-pass)
26
Carnegie Mellon DB/IR '06C. Faloutsos#26 Complexity Model update Space: O lgN + mk2 O lgN Time: O k 2 O 1 Where –N: number of points (so far) –k:number of regression coefficients; fixed –m: number of linear models; O lgN ?
27
Carnegie Mellon DB/IR '06C. Faloutsos#27 Overview Introduction / Related work Background Main idea Experimental results
28
Carnegie Mellon DB/IR '06C. Faloutsos#28 Results - Synthetic data Triangle pulse Mix (sine + square) AR captures wrong trend (or none) Seasonal AR estimation fails AWSOMARSeasonal AR
29
Carnegie Mellon DB/IR '06C. Faloutsos#29 Results - Real data Automobile traffic –Daily periodicity –Bursty “noise” at smaller scales AR fails to capture any trend Seasonal AR estimation fails
30
Carnegie Mellon DB/IR '06C. Faloutsos#30 Results - real data Sunspot intensity –Slightly time-varying “period” AR captures wrong trend Seasonal ARIMA –wrong downward trend, despite help by human!
31
Carnegie Mellon DB/IR '06C. Faloutsos#31 Conclusions Adapt and handle arbitrary periodic components and nimble Limited memory (logarithmic) Constant-time update on-line, any-time Single pass over the data automatic: No human intervention/tuning
32
Carnegie Mellon DB/IR '06C. Faloutsos#32 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID conclusions
33
Carnegie Mellon DB/IR '06C. Faloutsos#33 Part 2 SPIRIT: Mining co-evolving streams [Papadimitriou, Sun, Faloutsos, VLDB05]
34
Carnegie Mellon DB/IR '06C. Faloutsos#34 Motivation Eg., chlorine concentration in water distribution network
35
Carnegie Mellon DB/IR '06C. Faloutsos#35 Motivation water distribution network normal operation May have hundreds of measurements, but it is unlikely they are completely unrelated! Phase 1Phase 2Phase 3 : : : chlorine concentrations
36
Carnegie Mellon DB/IR '06C. Faloutsos#36 Phase 1Phase 2Phase 3 : : : Motivation water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak
37
Carnegie Mellon DB/IR '06C. Faloutsos#37 Phase 1Phase 2Phase 3 : : : Motivation water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak
38
Carnegie Mellon DB/IR '06C. Faloutsos#38 Motivation actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends Phase 1 : : : chlorine concentrations Phase 1 k = 1
39
Carnegie Mellon DB/IR '06C. Faloutsos#39 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 actual measurements (n streams) k hidden variable(s) k = 2 : : :
40
Carnegie Mellon DB/IR '06C. Faloutsos#40 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1 : : :
41
Carnegie Mellon DB/IR '06C. Faloutsos#41 Discover “hidden” (latent) variables for: –Summarization of main trends for users –Efficient forecasting, spotting outliers/anomalies and the usual: nimble: Limited memory requirements on-line, any-time: (single pass etc) automatic: No special parameters to tune Goals
42
Carnegie Mellon DB/IR '06C. Faloutsos#42 Related work Stream mining Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01]
43
Carnegie Mellon DB/IR '06C. Faloutsos#43 Related work Stream mining Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004] Queries on streams [Dobra, Garofalakis, Gehrke, et al / SIGMOD02], [Madden, Franklin, Hellerstein, et al / OSDI02], [Considine, Li, Kollios, et al / ICDE04], [Hammad, Aref, Elmagarmid / SSDBM03] …
44
Carnegie Mellon DB/IR '06C. Faloutsos#44 Overview Part 2 Method Experiments Conclusions & Other work
45
Carnegie Mellon DB/IR '06C. Faloutsos#45 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
46
Carnegie Mellon DB/IR '06C. Faloutsos#46 1. How to capture correlations? 20 o C 30 o C Temperature t 1 First sensor time
47
Carnegie Mellon DB/IR '06C. Faloutsos#47 1. How to capture correlations? First sensor Second sensor 20 o C 30 o C Temperature t 2 time
48
Carnegie Mellon DB/IR '06C. Faloutsos#48 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature t 1 Correlations: Let’s take a closer look at the first three value-pairs… Temperature t 2
49
Carnegie Mellon DB/IR '06C. Faloutsos#49 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature t 2 Temperature t 1 First three lie (almost) on a line in the space of value- pairs… O(n) numbers for the slope, and One number for each value-pair (offset on line) offset = “hidden variable” time=1 time=2 time=3
50
Carnegie Mellon DB/IR '06C. Faloutsos#50 1. How to capture correlations 20 o C30 o C 20 o C 30 o C Temperature t 2 Temperature t 1 Other pairs also follow the same pattern: they lie (approximately) on this line
51
Carnegie Mellon DB/IR '06C. Faloutsos#51 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
52
Carnegie Mellon Incremental updates error 20 o C 30 o C 20 o C 30 o C Temperature T 2 Temperature T 1
53
Carnegie Mellon Incremental updates Algorithm runs in O(n) where n= # of streams no need to access old data error 20 o C 30 o C 20 o C30 o C Temperature T 1
54
Carnegie Mellon DB/IR '06C. Faloutsos#54 Stream correlations Principal Component Analysis (PCA) The “line” is the first principal component (PC) This line is optimal: it minimizes the sum of squared projection errors
55
Carnegie Mellon DB/IR '06C. Faloutsos#55 2. Incremental update Given number of hidden variables k Assuming k is known We know how to update the slope For each new point x and for i = 1, …, k : y i := w i T x(proj. onto w i ) d i d i + y i 2 (energy i-th eigenval.) e i := x – y i w i (error) w i w i + (1/d i ) y i e i (update estimate) x x – y i w i (repeat with remainder) y1y1 w1w1 x e1e1 w 1 updated
56
Carnegie Mellon DB/IR '06C. Faloutsos#56 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust k, the number of hidden variables?
57
Carnegie Mellon DB/IR '06C. Faloutsos#57 Answer When the reconstruction accuracy is too low (say, <95%) then introduce another hidden variable (k++) [How to initialize its values: tricky]
58
Carnegie Mellon DB/IR '06C. Faloutsos#58 Missing values 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 true values (pair) all possible value pairs (given only t 1 ) best guess (given correlations: intersection)
59
Carnegie Mellon DB/IR '06C. Faloutsos#59 Forecasting ? Assume we want to forecast the next value for a particular stream (e.g. auto-regression) n streams
60
Carnegie Mellon DB/IR '06C. Faloutsos#60 Forecasting Option 1: One complex model per stream –Next value = function of previous values on all streams –Captures correlations –Too costly! [ ~ O(n 3 ) ] + n streams
61
Carnegie Mellon DB/IR '06C. Faloutsos#61 Forecasting Option 1: One complex model per stream Option 2: One simple model per stream –Next value = function of previous value on same stream –Worse accuracy, but maybe acceptable –But, still need n models + n streams
62
Carnegie Mellon DB/IR '06C. Faloutsos#62 Forecasting n streams hidden variables k hidden vars k << n and already capture correlations + Only k simple models Efficiency & robustness
63
Carnegie Mellon DB/IR '06C. Faloutsos#63 Time/space requirements Incremental PCA O(nk) space (total) and time (per tuple), i.e., Independent of # points Linear w.r.t. # streams (n) Linear w.r.t. # hidden variables (k) In fact, Can be done in real time
64
Carnegie Mellon DB/IR '06C. Faloutsos#64 Overview Part 2 Method Experiments Conclusions & Other work
65
Carnegie Mellon DB/IR '06C. Faloutsos#65 Experiments Chlorine concentration 166 streams 2 hidden variables (~4% error) Measurements Reconstruction [CMU Civil Engineering]
66
Carnegie Mellon DB/IR '06C. Faloutsos#66 Experiments Chlorine concentration hidden variables Both capture global, periodic pattern Second: ~ first, but phase-shifted Can express any phase-shift… [CMU Civil Engineering]
67
Carnegie Mellon DB/IR '06C. Faloutsos#67 Experiments Light measurements measurement reconstruction 54 sensors 2-4 hidden variables (~6% error)
68
Carnegie Mellon DB/IR '06C. Faloutsos#68 Experiments Light measurements 1 & 2: main trend (as before) 3 & 4: potential anomalies and outliers hidden variables intermittent
69
Carnegie Mellon DB/IR '06C. Faloutsos#69 Conclusions SPIRIT : Discovers hidden variables for –Summarization of main trends for users –Efficient forecasting, spotting outliers/anomalies Incremental, real time computation nimble: With limited memory automatic: No special parameters to tune
70
Carnegie Mellon DB/IR '06C. Faloutsos#70 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions
71
Carnegie Mellon DB/IR '06C. Faloutsos#71 Part 3: BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai, Spiros Papadimitriou, Christos Faloutsos SIGMOD’05
72
Carnegie Mellon DB/IR '06C. Faloutsos#72 Lag Correlations Examples –A decrease in interest rates typically precedes an increase in house sales by a few months –Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later
73
Carnegie Mellon DB/IR '06C. Faloutsos#73 Lag Correlations Example of lag-correlated sequences These sequences are correlated with lag l=1300 time-ticks CCF (Cross-Correlation Function)
74
Carnegie Mellon DB/IR '06C. Faloutsos#74 Lag Correlations Example of lag-correlated sequences CCF (Cross-Correlation Function) how to compute it quickly cheaply incrementally
75
Carnegie Mellon DB/IR '06C. Faloutsos#75 Challenging Problems Problem definitions –For given two co-evolving sequences X and Y, determine Whether there is a lag correlation If yes, what is the lag length l –For given k numerical sequences, X 1,…,X k, report Which pairs have a lag correlation The corresponding lag for each pair
76
Carnegie Mellon DB/IR '06C. Faloutsos#76 Our solution Ideal characteristics: –‘Any-time’ processing, and fast Computation time per time tick is constant –Nimble Memory space requirement is sub-linear of sequence length –Accurate Approximation introduces small error
77
Carnegie Mellon DB/IR '06C. Faloutsos#77 Sequence indexing –Agrawal et al. (FODO 1993) –Faloutsos et al. (SIGMOD 1994) –Keogh et al. (SIGMOD 2001) Compression (wavelet and random projections) –Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004) –Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003) Data Stream Management –Abadi et al. (VLDB Journal 2003) –Motwani et al. (CIDR 2003) –Chandrasekaran et al. (CIDR 2003) –Cranor et al. (SIGMOD 2003) Related Work
78
Carnegie Mellon DB/IR '06C. Faloutsos#78 Related Work Pattern discovery –Clustering for data streams Guha et al. (TKDE 2003) –Monitoring multiple streams Zhu et al. (VLDB 2002) –Forecasting Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003) None of previously published methods focuses on the problem
79
Carnegie Mellon DB/IR '06C. Faloutsos#79 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results
80
Carnegie Mellon DB/IR '06C. Faloutsos#80 Main Idea (1) Incremental compution –Sufficient statistics Sum of X : Square sum of X : Inner-product for X and the shifted Y : –Compute R(l) incrementally: Covariance of X and Y: Variance of X:
81
Carnegie Mellon DB/IR '06C. Faloutsos#81 Main Idea (2) Lag Correlation Sequence smoothing t=n Time
82
Carnegie Mellon DB/IR '06C. Faloutsos#82 Main Idea (2) Lag Correlation Level h=0 t=n Time Sequence smoothing –Means of windows for each level –Sufficient statistics computed from the means –CCF computed from the sufficient statistics –But, it allows a partial redundancy
83
Carnegie Mellon DB/IR '06C. Faloutsos#83 Main Idea (3) Lag Correlation Level h=0 t=n Time Geometric lag probing
84
Carnegie Mellon DB/IR '06C. Faloutsos#84 Main Idea (3) Lag Correlation Level h=0 t=n Time Geometric lag probing –Use colored windows –Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} –Use a cubic spline to interpolate
85
Carnegie Mellon DB/IR '06C. Faloutsos#85 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results
86
Carnegie Mellon DB/IR '06C. Faloutsos#86 Experimental results Setup –Intel Xeon 2.8GHz, 1GB memory, Linux –Datasets: Sines, SpikeTrains, Humidity, Light, Temperature, Kursk, Sunspots –Enhanced BRAID, b=16 Evaluation –Estimation error of lag correlations –Computation time
87
Carnegie Mellon DB/IR '06C. Faloutsos#87 Detecting Lag Correlations (2) SpikeTrains CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients
88
Carnegie Mellon DB/IR '06C. Faloutsos#88 Detecting Lag Correlations (3) Humidity CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients
89
Carnegie Mellon DB/IR '06C. Faloutsos#89 Detecting Lag Correlations (4) Light CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients
90
Carnegie Mellon DB/IR '06C. Faloutsos#90 Detecting Lag Correlations (5) Kursk CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients
91
Carnegie Mellon DB/IR '06C. Faloutsos#91 Estimation Error Largest relative error is about 1% 1.03811681156 Sunspots 0.61514721463 Kursk 0.529570567 Light 0.33838553842 Humidity 0.38728302841 SpikeTrains 0.000716 Sines BRAID Naive Estimation error (%) Lag correlation Datasets
92
Carnegie Mellon DB/IR '06C. Faloutsos#92 Performance Almost linear w.r.t. sequence length Up to 40,000 times faster
93
Carnegie Mellon DB/IR '06C. Faloutsos#93 Group Lag Correlations Two correlated pairs from 55 Temperature sequences Each sensor is located in a different place Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48 #16#19#47 #48
94
Carnegie Mellon DB/IR '06C. Faloutsos#94 Conclusions Automatic lag correlation detection on stream data incremental – online, ‘any-time’ nimble –O(log n) space, O(1) time to update the statistics –Up to 40,000 times faster than the naive implementation Accurate –Detecting the correct lag within 1% relative error or less
95
Carnegie Mellon DB/IR '06C. Faloutsos#95 Overall Conclusions Mining streaming numerical data: challenging! Extensions: streaming matrix data (eg., network traffic matrix) IP-source IP-destination time
96
Carnegie Mellon DB/IR '06C. Faloutsos#96 Thank you christos cs.cmu.edu www.cs.cmu.edu/~christos [InteMon demo]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.