Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU.

Similar presentations


Presentation on theme: "Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU."— Presentation transcript:

1 Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

2 Carnegie Mellon DB/IR '06C. Faloutsos#2 THANK YOU! Prof. Panos Ipeirotis Julia Mills

3 Carnegie Mellon DB/IR '06C. Faloutsos#3 Joint work with Spiros Papadimitriou (CMU->IBM) Jimeng Sun (CMU/CS) Anthony Brockwell (CMU/Stat) Jeanne Vanbriesen (CMU/CivEng) Greg Ganger (CMU/ECE)

4 Carnegie Mellon DB/IR '06C. Faloutsos#4 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions

5 Carnegie Mellon DB/IR '06C. Faloutsos#5 Problem definition - example Each sensor collects data (x 1, x 2, …, x t, …)

6 Carnegie Mellon DB/IR '06C. Faloutsos#6 Problem definition Given: one or more sequences x 1, x 2, …, x t, … (y 1, y 2, …, y t, … … ) Find –patterns; correlations; outliers –incrementally!

7 Carnegie Mellon DB/IR '06C. Faloutsos#7 Limitations / Challenges Find patterns using a method that is nimble: limited resources –Memory –Bandwidth, power, CPU incremental: on-line, ‘any-time’ response – single pass (‘you get to see it only once’) automatic: no human intervention –eg., in remote environments

8 Carnegie Mellon DB/IR '06C. Faloutsos#8 Application domains Sensor devices –Temperature, weather measurements –Road traffic data –Geological observations –Patient physiological data Embedded devices –Network routers –Intelligent (active) disks

9 Carnegie Mellon DB/IR '06C. Faloutsos#9 Motivation - Applications (cont’d) ‘Smart house’ –sensors monitor temperature, humidity, air quality video surveillance

10 Carnegie Mellon DB/IR '06C. Faloutsos#10 Motivation - Applications (cont’d) civil/automobile infrastructure –bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring

11 Carnegie Mellon DB/IR '06C. Faloutsos#11 Motivation - Applications (cont’d) Weather, environment/anti-pollution –volcano monitoring –air/water pollutant monitoring

12 Carnegie Mellon DB/IR '06C. Faloutsos#12 Motivation - Applications (cont’d) Computer systems –‘Active Disks’ (buffering, prefetching) –web servers (ditto) –network traffic monitoring –...

13 Carnegie Mellon InteMon w/ Evan Hoke, Jimeng Sun self-* PetaByte data center at CMU

14 Carnegie Mellon DB/IR '06C. Faloutsos#14 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID conclusions

15 Carnegie Mellon DB/IR '06C. Faloutsos#15 Single sequence mining - AWSOM with Spiros Papadimitriou (CMU -> IBM) Anthony Brockwell (CMU/Stat)

16 Carnegie Mellon DB/IR '06C. Faloutsos#16 Problem definition Semi-infinite streams of values (time series) x 1, x 2, …, x t, … Find patterns, forecasts, outliers… Periodicity? (daily) Periodicity? (twice daily) “Noise”??

17 Carnegie Mellon DB/IR '06C. Faloutsos#17 Requirements / Goals Adapt and handle arbitrary periodic components and nimble (limited resources, single pass) on-line, any-time automatic (no human intervention/tuning)

18 Carnegie Mellon DB/IR '06C. Faloutsos#18 Overview Introduction / Related work Background Main idea Experimental results

19 Carnegie Mellon DB/IR '06C. Faloutsos#19 Wavelets Example – Haar transform t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency t xtxt “constant”

20 Carnegie Mellon DB/IR '06C. Faloutsos#20 Wavelets Why we like them Wavelets compress many real signals well: –Image compression and processing –Vision –Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive

21 Carnegie Mellon DB/IR '06C. Faloutsos#21 Overview Introduction / Related work Background Main idea Experimental results

22 Carnegie Mellon DB/IR '06C. Faloutsos#22 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency =

23 Carnegie Mellon DB/IR '06C. Faloutsos#23 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency

24 Carnegie Mellon DB/IR '06C. Faloutsos#24 AWSOM - idea W l,t W l,t-1 W l,t-2 W l,t   l,1 W l,t-1   l,2 W l,t-2  … W l’,t’-1 W l’,t’-2 W l’,t’ W l’,t’   l’,1 W l’,t’-1   l’,2 W l’,t’-2  …

25 Carnegie Mellon DB/IR '06C. Faloutsos#25 More details… Update of wavelet coefficients Update of linear models Feature selection –Not all correlations are significant –Throw away the insignificant ones (“noise”) (incremental) (incremental; RLS) (single-pass)

26 Carnegie Mellon DB/IR '06C. Faloutsos#26 Complexity Model update Space: O  lgN + mk2   O  lgN  Time: O  k 2   O  1  Where –N: number of points (so far) –k:number of regression coefficients; fixed –m: number of linear models; O  lgN  ?

27 Carnegie Mellon DB/IR '06C. Faloutsos#27 Overview Introduction / Related work Background Main idea Experimental results

28 Carnegie Mellon DB/IR '06C. Faloutsos#28 Results - Synthetic data Triangle pulse Mix (sine + square) AR captures wrong trend (or none) Seasonal AR estimation fails AWSOMARSeasonal AR

29 Carnegie Mellon DB/IR '06C. Faloutsos#29 Results - Real data Automobile traffic –Daily periodicity –Bursty “noise” at smaller scales AR fails to capture any trend Seasonal AR estimation fails

30 Carnegie Mellon DB/IR '06C. Faloutsos#30 Results - real data Sunspot intensity –Slightly time-varying “period” AR captures wrong trend Seasonal ARIMA –wrong downward trend, despite help by human! 

31 Carnegie Mellon DB/IR '06C. Faloutsos#31 Conclusions Adapt and handle arbitrary periodic components and nimble Limited memory (logarithmic) Constant-time update on-line, any-time Single pass over the data automatic: No human intervention/tuning

32 Carnegie Mellon DB/IR '06C. Faloutsos#32 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID conclusions

33 Carnegie Mellon DB/IR '06C. Faloutsos#33 Part 2 SPIRIT: Mining co-evolving streams [Papadimitriou, Sun, Faloutsos, VLDB05]

34 Carnegie Mellon DB/IR '06C. Faloutsos#34 Motivation Eg., chlorine concentration in water distribution network

35 Carnegie Mellon DB/IR '06C. Faloutsos#35 Motivation water distribution network normal operation May have hundreds of measurements, but it is unlikely they are completely unrelated! Phase 1Phase 2Phase 3 : : : chlorine concentrations

36 Carnegie Mellon DB/IR '06C. Faloutsos#36 Phase 1Phase 2Phase 3 : : : Motivation water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak

37 Carnegie Mellon DB/IR '06C. Faloutsos#37 Phase 1Phase 2Phase 3 : : : Motivation water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak

38 Carnegie Mellon DB/IR '06C. Faloutsos#38 Motivation actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends Phase 1 : : : chlorine concentrations Phase 1 k = 1

39 Carnegie Mellon DB/IR '06C. Faloutsos#39 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 actual measurements (n streams) k hidden variable(s) k = 2 : : :

40 Carnegie Mellon DB/IR '06C. Faloutsos#40 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1 : : :

41 Carnegie Mellon DB/IR '06C. Faloutsos#41 Discover “hidden” (latent) variables for: –Summarization of main trends for users –Efficient forecasting, spotting outliers/anomalies and the usual: nimble: Limited memory requirements on-line, any-time: (single pass etc) automatic: No special parameters to tune Goals

42 Carnegie Mellon DB/IR '06C. Faloutsos#42 Related work Stream mining Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01]

43 Carnegie Mellon DB/IR '06C. Faloutsos#43 Related work Stream mining Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004] Queries on streams [Dobra, Garofalakis, Gehrke, et al / SIGMOD02], [Madden, Franklin, Hellerstein, et al / OSDI02], [Considine, Li, Kollios, et al / ICDE04], [Hammad, Aref, Elmagarmid / SSDBM03] …

44 Carnegie Mellon DB/IR '06C. Faloutsos#44 Overview Part 2 Method Experiments Conclusions & Other work

45 Carnegie Mellon DB/IR '06C. Faloutsos#45 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

46 Carnegie Mellon DB/IR '06C. Faloutsos#46 1. How to capture correlations? 20 o C 30 o C Temperature t 1 First sensor time

47 Carnegie Mellon DB/IR '06C. Faloutsos#47 1. How to capture correlations? First sensor Second sensor 20 o C 30 o C Temperature t 2 time

48 Carnegie Mellon DB/IR '06C. Faloutsos#48 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature t 1 Correlations: Let’s take a closer look at the first three value-pairs… Temperature t 2

49 Carnegie Mellon DB/IR '06C. Faloutsos#49 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature t 2 Temperature t 1 First three lie (almost) on a line in the space of value- pairs…  O(n) numbers for the slope, and  One number for each value-pair (offset on line) offset = “hidden variable” time=1 time=2 time=3

50 Carnegie Mellon DB/IR '06C. Faloutsos#50 1. How to capture correlations 20 o C30 o C 20 o C 30 o C Temperature t 2 Temperature t 1 Other pairs also follow the same pattern: they lie (approximately) on this line

51 Carnegie Mellon DB/IR '06C. Faloutsos#51 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

52 Carnegie Mellon Incremental updates error 20 o C 30 o C 20 o C 30 o C Temperature T 2 Temperature T 1

53 Carnegie Mellon Incremental updates Algorithm runs in O(n) where n= # of streams no need to access old data error 20 o C 30 o C 20 o C30 o C Temperature T 1

54 Carnegie Mellon DB/IR '06C. Faloutsos#54 Stream correlations Principal Component Analysis (PCA) The “line” is the first principal component (PC) This line is optimal: it minimizes the sum of squared projection errors

55 Carnegie Mellon DB/IR '06C. Faloutsos#55 2. Incremental update Given number of hidden variables k Assuming k is known We know how to update the slope For each new point x and for i = 1, …, k : y i := w i T x(proj. onto w i ) d i  d i + y i 2 (energy  i-th eigenval.) e i := x – y i w i (error) w i  w i + (1/d i ) y i e i (update estimate) x  x – y i w i (repeat with remainder) y1y1 w1w1 x e1e1 w 1 updated

56 Carnegie Mellon DB/IR '06C. Faloutsos#56 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust k, the number of hidden variables?

57 Carnegie Mellon DB/IR '06C. Faloutsos#57 Answer When the reconstruction accuracy is too low (say, <95%) then introduce another hidden variable (k++) [How to initialize its values: tricky]

58 Carnegie Mellon DB/IR '06C. Faloutsos#58 Missing values 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 true values (pair) all possible value pairs (given only t 1 ) best guess (given correlations: intersection)

59 Carnegie Mellon DB/IR '06C. Faloutsos#59 Forecasting ? Assume we want to forecast the next value for a particular stream (e.g. auto-regression) n streams

60 Carnegie Mellon DB/IR '06C. Faloutsos#60 Forecasting Option 1: One complex model per stream –Next value = function of previous values on all streams –Captures correlations –Too costly! [ ~ O(n 3 ) ] + n streams

61 Carnegie Mellon DB/IR '06C. Faloutsos#61 Forecasting Option 1: One complex model per stream Option 2: One simple model per stream –Next value = function of previous value on same stream –Worse accuracy, but maybe acceptable –But, still need n models + n streams

62 Carnegie Mellon DB/IR '06C. Faloutsos#62 Forecasting n streams hidden variables k hidden vars k << n and already capture correlations + Only k simple models Efficiency & robustness

63 Carnegie Mellon DB/IR '06C. Faloutsos#63 Time/space requirements Incremental PCA O(nk) space (total) and time (per tuple), i.e., Independent of # points Linear w.r.t. # streams (n) Linear w.r.t. # hidden variables (k) In fact, Can be done in real time

64 Carnegie Mellon DB/IR '06C. Faloutsos#64 Overview Part 2 Method Experiments Conclusions & Other work

65 Carnegie Mellon DB/IR '06C. Faloutsos#65 Experiments Chlorine concentration 166 streams 2 hidden variables (~4% error) Measurements Reconstruction [CMU Civil Engineering]

66 Carnegie Mellon DB/IR '06C. Faloutsos#66 Experiments Chlorine concentration hidden variables Both capture global, periodic pattern Second: ~ first, but phase-shifted Can express any phase-shift… [CMU Civil Engineering]

67 Carnegie Mellon DB/IR '06C. Faloutsos#67 Experiments Light measurements measurement reconstruction 54 sensors 2-4 hidden variables (~6% error)

68 Carnegie Mellon DB/IR '06C. Faloutsos#68 Experiments Light measurements 1 & 2: main trend (as before) 3 & 4: potential anomalies and outliers hidden variables intermittent

69 Carnegie Mellon DB/IR '06C. Faloutsos#69 Conclusions SPIRIT : Discovers hidden variables for –Summarization of main trends for users –Efficient forecasting, spotting outliers/anomalies Incremental, real time computation nimble: With limited memory automatic: No special parameters to tune

70 Carnegie Mellon DB/IR '06C. Faloutsos#70 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions

71 Carnegie Mellon DB/IR '06C. Faloutsos#71 Part 3: BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai, Spiros Papadimitriou, Christos Faloutsos SIGMOD’05

72 Carnegie Mellon DB/IR '06C. Faloutsos#72 Lag Correlations Examples –A decrease in interest rates typically precedes an increase in house sales by a few months –Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later

73 Carnegie Mellon DB/IR '06C. Faloutsos#73 Lag Correlations Example of lag-correlated sequences These sequences are correlated with lag l=1300 time-ticks CCF (Cross-Correlation Function)

74 Carnegie Mellon DB/IR '06C. Faloutsos#74 Lag Correlations Example of lag-correlated sequences CCF (Cross-Correlation Function) how to compute it quickly cheaply incrementally

75 Carnegie Mellon DB/IR '06C. Faloutsos#75 Challenging Problems Problem definitions –For given two co-evolving sequences X and Y, determine Whether there is a lag correlation If yes, what is the lag length l –For given k numerical sequences, X 1,…,X k, report Which pairs have a lag correlation The corresponding lag for each pair

76 Carnegie Mellon DB/IR '06C. Faloutsos#76 Our solution Ideal characteristics: –‘Any-time’ processing, and fast Computation time per time tick is constant –Nimble Memory space requirement is sub-linear of sequence length –Accurate Approximation introduces small error

77 Carnegie Mellon DB/IR '06C. Faloutsos#77 Sequence indexing –Agrawal et al. (FODO 1993) –Faloutsos et al. (SIGMOD 1994) –Keogh et al. (SIGMOD 2001) Compression (wavelet and random projections) –Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004) –Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003) Data Stream Management –Abadi et al. (VLDB Journal 2003) –Motwani et al. (CIDR 2003) –Chandrasekaran et al. (CIDR 2003) –Cranor et al. (SIGMOD 2003) Related Work

78 Carnegie Mellon DB/IR '06C. Faloutsos#78 Related Work Pattern discovery –Clustering for data streams Guha et al. (TKDE 2003) –Monitoring multiple streams Zhu et al. (VLDB 2002) –Forecasting Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003) None of previously published methods focuses on the problem

79 Carnegie Mellon DB/IR '06C. Faloutsos#79 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

80 Carnegie Mellon DB/IR '06C. Faloutsos#80 Main Idea (1) Incremental compution –Sufficient statistics Sum of X : Square sum of X : Inner-product for X and the shifted Y : –Compute R(l) incrementally: Covariance of X and Y: Variance of X:

81 Carnegie Mellon DB/IR '06C. Faloutsos#81 Main Idea (2) Lag Correlation Sequence smoothing t=n Time

82 Carnegie Mellon DB/IR '06C. Faloutsos#82 Main Idea (2) Lag Correlation Level h=0 t=n Time Sequence smoothing –Means of windows for each level –Sufficient statistics computed from the means –CCF computed from the sufficient statistics –But, it allows a partial redundancy

83 Carnegie Mellon DB/IR '06C. Faloutsos#83 Main Idea (3) Lag Correlation Level h=0 t=n Time Geometric lag probing

84 Carnegie Mellon DB/IR '06C. Faloutsos#84 Main Idea (3) Lag Correlation Level h=0 t=n Time Geometric lag probing –Use colored windows –Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} –Use a cubic spline to interpolate

85 Carnegie Mellon DB/IR '06C. Faloutsos#85 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

86 Carnegie Mellon DB/IR '06C. Faloutsos#86 Experimental results Setup –Intel Xeon 2.8GHz, 1GB memory, Linux –Datasets: Sines, SpikeTrains, Humidity, Light, Temperature, Kursk, Sunspots –Enhanced BRAID, b=16 Evaluation –Estimation error of lag correlations –Computation time

87 Carnegie Mellon DB/IR '06C. Faloutsos#87 Detecting Lag Correlations (2) SpikeTrains CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

88 Carnegie Mellon DB/IR '06C. Faloutsos#88 Detecting Lag Correlations (3) Humidity CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

89 Carnegie Mellon DB/IR '06C. Faloutsos#89 Detecting Lag Correlations (4) Light CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

90 Carnegie Mellon DB/IR '06C. Faloutsos#90 Detecting Lag Correlations (5) Kursk CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

91 Carnegie Mellon DB/IR '06C. Faloutsos#91 Estimation Error Largest relative error is about 1% 1.03811681156 Sunspots 0.61514721463 Kursk 0.529570567 Light 0.33838553842 Humidity 0.38728302841 SpikeTrains 0.000716 Sines BRAID Naive Estimation error (%) Lag correlation Datasets

92 Carnegie Mellon DB/IR '06C. Faloutsos#92 Performance Almost linear w.r.t. sequence length Up to 40,000 times faster

93 Carnegie Mellon DB/IR '06C. Faloutsos#93 Group Lag Correlations Two correlated pairs from 55 Temperature sequences Each sensor is located in a different place Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48 #16#19#47 #48

94 Carnegie Mellon DB/IR '06C. Faloutsos#94 Conclusions Automatic lag correlation detection on stream data incremental – online, ‘any-time’ nimble –O(log n) space, O(1) time to update the statistics –Up to 40,000 times faster than the naive implementation Accurate –Detecting the correct lag within 1% relative error or less

95 Carnegie Mellon DB/IR '06C. Faloutsos#95 Overall Conclusions Mining streaming numerical data: challenging! Extensions: streaming matrix data (eg., network traffic matrix) IP-source IP-destination time

96 Carnegie Mellon DB/IR '06C. Faloutsos#96 Thank you christos cs.cmu.edu www.cs.cmu.edu/~christos [InteMon demo]


Download ppt "Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU."

Similar presentations


Ads by Google