Presentation is loading. Please wait.

Presentation is loading. Please wait.

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.

Similar presentations


Presentation on theme: "BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos."— Presentation transcript:

1 BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.)

2 SIGMOD 2005 Y. Sakurai et al 2 Motivation Data-stream applications  Network analysis  Sensor monitoring  Financial data analysis  Moving object tracking Goal  Monitor multiple numerical streams  Determine which pairs are correlated with lags  Report the value of each such lag (if any)

3 SIGMOD 2005 Y. Sakurai et al 3 Lag Correlations Examples  A decrease in interest rates typically precedes an increase in house sales by a few months  Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later  High CPU utilization on server 1 precedes high CPU utilization for server 2 by a few minutes

4 SIGMOD 2005 Y. Sakurai et al 4 Lag Correlations Example of lag-correlated sequences These sequences are correlated with lag l=1300 time-ticks CCF (Cross-Correlation Function)

5 SIGMOD 2005 Y. Sakurai et al 5 Lag Correlations CCF (Cross-Correlation Function) Example of lag-correlated sequences  Fast (high performance)  Nimble (Low memory consumption)  Accurate (good approximation)

6 SIGMOD 2005 Y. Sakurai et al 6 Problem #1: PAIR of sequences For given two co-evolving sequences X and Y, determine  Whether there is a lag correlation  If yes, what is the lag length l Any time, on semi-infinite streams ? yes; l = 1,300 X Y

7 SIGMOD 2005 Y. Sakurai et al 7 Problem #2: k-way For given k numerical sequences, X 1,…,X k, report  Which pairs (if any) have a lag correlation  The corresponding lag for such pairs again, ‘any time’, streaming fashion ? X 1 and X 2 ; l = 1,300... X1X1 X2X2 XkXk

8 SIGMOD 2005 Y. Sakurai et al 8 Our solution, BRAID characteristics:  ‘Any-time’ processing, and fast Computation time per time tick is constant  Nimble Memory space requirement is sub-linear of sequence length  Accurate Approximation introduces small error

9 SIGMOD 2005 Y. Sakurai et al 9 Sequence indexing  Agrawal et al. (FODO 1993)  Faloutsos et al. (SIGMOD 1994)  Keogh et al. (SIGMOD 2001) Compression (wavelet and random projections)  Gilbert et al. (VLDB 2001)  Guha et al. (VLDB 2004)  Dobra et al.(SIGMOD 2002)  Ganguly et al.(SIGMOD 2003) Related Work

10 SIGMOD 2005 Y. Sakurai et al 10 Data Stream Management  Abadi et al. (VLDB Journal 2003)  Motwani et al. (CIDR 2003)  Chandrasekaran et al. (CIDR 2003)  Cranor et al. (SIGMOD 2003) Related Work

11 SIGMOD 2005 Y. Sakurai et al 11 Related Work Pattern discovery  Clustering for data streams Guha et al. (TKDE 2003)  Monitoring multiple streams Zhu et al. (VLDB 2002)  Forecasting Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003) None of previously published methods focuses on the problem

12 SIGMOD 2005 Y. Sakurai et al 12 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

13 SIGMOD 2005 Y. Sakurai et al 13 Background CCF (Cross-Correlation Function) positively correlated un-correlated ++ anti-correlated (lower than -  ) Lag correlation Lag Correlation

14 SIGMOD 2005 Y. Sakurai et al 14 Background Definition of ‘ score ’, the absolute value of R(l) Lag correlation  Given a threshold ,  A local maximum  The earliest such maximum, if more maxima exist details

15 SIGMOD 2005 Y. Sakurai et al 15 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

16 SIGMOD 2005 Y. Sakurai et al 16 Why not ‘ naive ’ ? Naive solution:  Compute correlation coefficient for each lag l = 0, 1, 2, 3, …, n/2 But,  O(n) space  O(n 2 ) time or O(n log n) time w/ FFT t=n Time Lag Correlation n/2 0

17 SIGMOD 2005 Y. Sakurai et al 17 Main Idea (1) Incremental computing:  the correlation coefficient of two sequences is ‘algebraic’ -> can be computed incrementally we need to maintain only 6 ‘sufficient statistics’:  Sequence length n  Sum of X, Square sum of X  Sum of Y, Square sum of Y  Inner-product for X and the shifted Y

18 SIGMOD 2005 Y. Sakurai et al 18 Main Idea (1) Incremental computing: Sequence length n Sum of X : Square sum of X : Inner-product for X and the shifted Y :  Compute R(l) incrementally: Covariance of X and Y: Variance of X: details

19 SIGMOD 2005 Y. Sakurai et al 19 Main Idea (1) Complexity Naive (incremental) BRAID Space O(n)O(n)O(n)O(n) Comp. time O(n log n)O(n)O(n) Better, but not good enough!

20 SIGMOD 2005 Y. Sakurai et al 20 Main Idea (2) Lag Correlation Geometric lag probing

21 SIGMOD 2005 Y. Sakurai et al 21 Main Idea (2) 0 1 2 48 Lag Correlation Geometric lag probing ie., compute the correlation coefficient for lag: l = 0, 1, 2, 4,... 2 h O(log n) estimations

22 SIGMOD 2005 Y. Sakurai et al 22 Main Idea (2) Geometric lag probing But, so far, we still need O(n) space because the longest lag is n/2 Naive (incremental) BRAID Space O(n)O(n)O(n)O(n) Comp. time O(n log n)O(n)O(n)O(log n)

23 SIGMOD 2005 Y. Sakurai et al 23 Main Idea (3) Lag Correlation Sequence smoothing t=n Time Reminder: Naïve:

24 SIGMOD 2005 Y. Sakurai et al 24 Main Idea (3) Lag Correlation Level h=0 t=n Time Sequence smoothing  Means of windows for each level  Sufficient statistics computed from the means  CCF computed from the sufficient statistics  But, it allows a partial redundancy

25 SIGMOD 2005 Y. Sakurai et al 25 Putting it all together: Lag Correlation Level h=0 t=n Time Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…}

26 SIGMOD 2005 Y. Sakurai et al 26 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=0 t=n Time h=0 Y X l=0

27 SIGMOD 2005 Y. Sakurai et al 27 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=0 t=n Time h=0 Y X l=1

28 SIGMOD 2005 Y. Sakurai et al 28 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=1 t h =n/2 Time h=1 Y X l=2

29 SIGMOD 2005 Y. Sakurai et al 29 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=2 Time h=2 Y X t h =n/4 l=4

30 SIGMOD 2005 Y. Sakurai et al 30 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=3 Time h=3 Y X t h =n/8 l=8

31 SIGMOD 2005 Y. Sakurai et al 31 Putting it all together: Lag Correlation Level h=0 t=n Time Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…}  Use a cubic spline to interpolate

32 SIGMOD 2005 Y. Sakurai et al 32 Thus: Complexity Naive (incremental) BRAID Space O(n)O(n)O(n)O(n)O(log n) Comp. time O(n log n)O(n)O(n)O(1) * (*) Computation time: O(logn) And actually, amortized time: O(1)

33 SIGMOD 2005 Y. Sakurai et al 33 Overview Introduction / Related work Background Main ideas  enhancing the accuracy Theoretical analysis Experimental results details

34 SIGMOD 2005 Y. Sakurai et al 34 Enhanced Probing Scheme Q: How to probe more densely than 2 h ? Lag Correlation Level h=0 t=n Time

35 SIGMOD 2005 Y. Sakurai et al 35 Enhanced Probing Scheme Q: How to probe more densely than 2 h ? A: probe in a mixture of geometric and arithmetic progressions Lag Correlation Level h=0 t=n Time

36 SIGMOD 2005 Y. Sakurai et al 36 Enhanced Probing Scheme Basic scheme: b=1 (one number for each level) Enhanced scheme: b>1  Example of b=4  Probing the CCF in a mixture of geometric and arithmetic progressions: l={0,1,…,7;8,10,12,14;16,20,24,28;32,40,…} Level h=0 Time t=n Correlation Lag step:1step: 2 step: 4

37 SIGMOD 2005 Y. Sakurai et al 37 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

38 SIGMOD 2005 Y. Sakurai et al 38 Theoretical Analysis - Accuracy Effect of smoothing Effect of geometric lag probing For sequences with low frequencies, smoothing introduces only small error BRAIDS will provide no error, if lag probing satisfies the sampling theorem (Nyquist’s)

39 SIGMOD 2005 Y. Sakurai et al 39 Effect of geometric lag probing  Informally, BRAIDS will provide no error, if lag probing satisfies the sampling theorem (Nyquist’s)  Formally: Theorem 2 f R : the Nyquist frequency of CCF, f R = min( f x, f y ) f x, f y : the Nyquist frequencies of X and Y Theoretical Analysis - Accuracy BRAID will find the lag correlations perfectly, if details

40 SIGMOD 2005 Y. Sakurai et al 40 Theoretical Analysis - Complexity Naive solution  O(n) space  O(n) time per time tick BRAID  O(log n) space  O(1) time for updating sufficient statistics  O(log n) time for interpolating (when output is required) details

41 SIGMOD 2005 Y. Sakurai et al 41 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

42 SIGMOD 2005 Y. Sakurai et al 42 Experimental results Setup  Intel Xeon 2.8GHz, 1GB memory, Linux  Datasets: Synthetic: Sines, SpikeTrains, Real: Humidity, Light, Temperature, Kursk, Sunspots  Enhanced BRAID, b=16

43 SIGMOD 2005 Y. Sakurai et al 43 Experimental results Evaluation  Accuracy for CCF  Accuracy for the lag estimation  Computation time  k -way lag correlations

44 SIGMOD 2005 Y. Sakurai et al 44 Accuracy for CCF (1) Sines CCF (Cross-Correlation Function) BRAID perfectly estimates the correlation coefficients of the sinusoidal wave

45 SIGMOD 2005 Y. Sakurai et al 45 Accuracy for CCF (2) SpikeTrains CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

46 SIGMOD 2005 Y. Sakurai et al 46 Accuracy for CCF (3) Humidity (Real data) CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

47 SIGMOD 2005 Y. Sakurai et al 47 Accuracy for CCF (4) Light (Real data) CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

48 SIGMOD 2005 Y. Sakurai et al 48 Accuracy for CCF (5) Kursk (Real data) CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

49 SIGMOD 2005 Y. Sakurai et al 49 Accuracy for CCF (6) Sunspots (Real data) CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

50 SIGMOD 2005 Y. Sakurai et al 50 Experimental results Evaluation  Accuracy for CCF  Accuracy for the lag estimation  Computation time  k -way lag correlations

51 SIGMOD 2005 Y. Sakurai et al 51 Estimation Error of Lag Correlations Largest relative error is about 1% Datasets Lag correlationEstimation error (%) NaiveBRAID Sines 716 0.000 SpikeTrains 284128300.387 Humidity 384238550.338 Light 5675700.529 Kursk 146314720.615 Sunspots 115611681.038

52 SIGMOD 2005 Y. Sakurai et al 52 Experimental results Evaluation  Accuracy for CCF  Accuracy for the lag estimation  Computation time  k -way lag correlations

53 SIGMOD 2005 Y. Sakurai et al 53 Computation time Reduce computation time dramatically Up to 40,000 times faster

54 SIGMOD 2005 Y. Sakurai et al 54 Experimental results Evaluation  Accuracy for CCF  Accuracy for the lag estimation  Computation time  k -way lag correlations

55 SIGMOD 2005 Y. Sakurai et al 55 Group Lag Correlations 55 Temperature sequences Two correlated pairs Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48 #16#19#47 #48

56 SIGMOD 2005 Y. Sakurai et al 56 Conclusions Automatic lag correlation detection on data stream 1. ‘Any-time’ 2. Nimble  O(log n) space, O(1) time to update the statistics 3. Fast  Up to 40,000 times faster than the naive implementation 4. Accurate  within 1% relative error or less

57 SIGMOD 2005 Y. Sakurai et al 57 Effect of geometric lag probing  Informally, BRAIDS will provide no error, if lag probing satisfies the sampling theorem (Nyquist’s)  Formally: Theorem 2 f R : the Nyquist frequency of CCF, f R = min( f x, f y ) f x, f y : the Nyquist frequencies of X and Y Theoretical Analysis - Accuracy BRAID will find the lag correlations perfectly, if details

58 SIGMOD 2005 Y. Sakurai et al 58 Effect of Probing Dataset: Sines Lag correlation with b=1 l R =1024

59 SIGMOD 2005 Y. Sakurai et al 59 Effect of Probing Dataset: Light Lag correlation with b=1 l R =630


Download ppt "BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos."

Similar presentations


Ads by Google