Download presentation
Presentation is loading. Please wait.
Published byLeonard McBride Modified over 9 years ago
1
BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005
2
Outline Introduction Introduction Proposed method Proposed method EXPERIMENTS EXPERIMENTS CONCLUSIONS CONCLUSIONS
3
Introduction Data Stream Data Stream Lag correlations : Lag correlations : For example: For example: Higher amounts of fluoride in water → fewer dental cavities some years later Higher amounts of fluoride in water → fewer dental cavities some years later Goal : Goal : Monitor multiple numerical streams determine the pair correlated with lag and the value Monitor multiple numerical streams determine the pair correlated with lag and the value
4
Introduction k numerical sequences X 1,…X k, report all pair of X i and X j which X i follow X j with lag l k numerical sequences X 1,…X k, report all pair of X i and X j which X i follow X j with lag l
5
Introduction
6
Introduction In this paper, propose BRAID handle data stream In this paper, propose BRAID handle data stream Any time processing, and fast Any time processing, and fast Nimble Nimble Accurate Accurate Small resource consumption Small resource consumption
7
Proposed method Data stream X : {x 1, …, x t,..., x n }, x n is the most recent value Data stream X : {x 1, …, x t,..., x n }, x n is the most recent value R(0) : X and Y with the same length n and have zero lag R(0) : X and Y with the same length n and have zero lag Pearson ρ Coefficient : Pearson ρ Coefficient :
8
Proposed method For lag l,consider common part of X and shifted Y For lag l,consider common part of X and shifted Y
9
Proposed method
10
R(l) : correlation coefficient, X is delayed by l R(l) : correlation coefficient, X is delayed by l Score at lag l : Score at lag l :
11
Proposed method R(l) for large value of lag l ≈ n, the original and shifted time sequence have too few overlapping R(l) for large value of lag l ≈ n, the original and shifted time sequence have too few overlapping Restrict maximum lag m to be n/2 Restrict maximum lag m to be n/2
12
Proposed method Naive solution : Naive solution : At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1, … ) At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1, … ) Choose earliest max score above r, or report no lag Choose earliest max score above r, or report no lag The solution based on three major step The solution based on three major step
13
Proposed method Need some sufficient statistics for R to computed easily Need some sufficient statistics for R to computed easily Sx(l,n) = : sum of X of length n Sx(l,n) = : sum of X of length n Sxx(l,n) = : sum of square X of length n Sxx(l,n) = : sum of square X of length n Sxy(l) = : sum of square X of length n Sxy(l) = : sum of square X of length n
14
Proposed method R(l) is obtained : R(l) is obtained :
15
Proposed method R(l) can estimate at any point time, only need to keep track five sufficient statistics R(l) can estimate at any point time, only need to keep track five sufficient statistics It still needs linear time to compute the cross-correlation function between two sequences It still needs linear time to compute the cross-correlation function between two sequences
16
Proposed method Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2 i,. Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2 i,. Only O(logn) number to track of, instead of O(n) that “ Na ï ve solution ” requires Only O(logn) number to track of, instead of O(n) that “ Na ï ve solution ” requires Space required grow linearly with length n Space required grow linearly with length n
17
Proposed method In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space Instead of operating on original time sequence, we also compute their smoothed version, by computing the means of non-overlapping windows Instead of operating on original time sequence, we also compute their smoothed version, by computing the means of non-overlapping windows
18
Proposed method Window size : power of g=2 Window size : power of g=2 X : original time sequence X : original time sequence Ax h : smoothed version with window of length 2 h Ax h : smoothed version with window of length 2 h Ax 0 : original sequence, Ax 1 : consists of n/2 ticks,..etc Ax 0 : original sequence, Ax 1 : consists of n/2 ticks,..etc Ax h ‘s sufficient statistic need compute every 2 h time ticks Ax h ‘s sufficient statistic need compute every 2 h time ticks At time n, need O(log n) level, for each level compute sufficient statistic At time n, need O(log n) level, for each level compute sufficient statistic
20
Proposed method In contrast with small lags, the larger one are sparse In contrast with small lags, the larger one are sparse Use cubic spline to interpolate the missing correlation coefficient Use cubic spline to interpolate the missing correlation coefficient
21
Proposed method Ax h (t) : window average at time tick t for level h Ax h (t) : window average at time tick t for level h Ax h (0) ≡ x t Ax h (0) ≡ x t
22
Proposed method Sufficient statistics: Sufficient statistics:
24
EXPERIMENTS
27
Conclusion Proposed BRAID to detection lag correlation on streaming data Proposed BRAID to detection lag correlation on streaming data At any time At any time Low resource consumption Low resource consumption High accuracy High accuracy
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.