BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005
Outline Introduction Introduction Proposed method Proposed method EXPERIMENTS EXPERIMENTS CONCLUSIONS CONCLUSIONS
Introduction Data Stream Data Stream Lag correlations : Lag correlations : For example: For example: Higher amounts of fluoride in water → fewer dental cavities some years later Higher amounts of fluoride in water → fewer dental cavities some years later Goal : Goal : Monitor multiple numerical streams determine the pair correlated with lag and the value Monitor multiple numerical streams determine the pair correlated with lag and the value
Introduction k numerical sequences X 1,…X k, report all pair of X i and X j which X i follow X j with lag l k numerical sequences X 1,…X k, report all pair of X i and X j which X i follow X j with lag l
Introduction
Introduction In this paper, propose BRAID handle data stream In this paper, propose BRAID handle data stream Any time processing, and fast Any time processing, and fast Nimble Nimble Accurate Accurate Small resource consumption Small resource consumption
Proposed method Data stream X : {x 1, …, x t,..., x n }, x n is the most recent value Data stream X : {x 1, …, x t,..., x n }, x n is the most recent value R(0) : X and Y with the same length n and have zero lag R(0) : X and Y with the same length n and have zero lag Pearson ρ Coefficient : Pearson ρ Coefficient :
Proposed method For lag l,consider common part of X and shifted Y For lag l,consider common part of X and shifted Y
Proposed method
R(l) : correlation coefficient, X is delayed by l R(l) : correlation coefficient, X is delayed by l Score at lag l : Score at lag l :
Proposed method R(l) for large value of lag l ≈ n, the original and shifted time sequence have too few overlapping R(l) for large value of lag l ≈ n, the original and shifted time sequence have too few overlapping Restrict maximum lag m to be n/2 Restrict maximum lag m to be n/2
Proposed method Naive solution : Naive solution : At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1, … ) At time n, access all value of X and Y, compute R(l) of all value lag l(=0,1, … ) Choose earliest max score above r, or report no lag Choose earliest max score above r, or report no lag The solution based on three major step The solution based on three major step
Proposed method Need some sufficient statistics for R to computed easily Need some sufficient statistics for R to computed easily Sx(l,n) = : sum of X of length n Sx(l,n) = : sum of X of length n Sxx(l,n) = : sum of square X of length n Sxx(l,n) = : sum of square X of length n Sxy(l) = : sum of square X of length n Sxy(l) = : sum of square X of length n
Proposed method R(l) is obtained : R(l) is obtained :
Proposed method R(l) can estimate at any point time, only need to keep track five sufficient statistics R(l) can estimate at any point time, only need to keep track five sufficient statistics It still needs linear time to compute the cross-correlation function between two sequences It still needs linear time to compute the cross-correlation function between two sequences
Proposed method Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2 i,. Propose to keep track of only a geometric progression of the lag value : l= 0,1,2,..2 i,. Only O(logn) number to track of, instead of O(n) that “ Na ï ve solution ” requires Only O(logn) number to track of, instead of O(n) that “ Na ï ve solution ” requires Space required grow linearly with length n Space required grow linearly with length n
Proposed method In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space In order to compute R(l) at any time, keep sliding window of size l, m=n/2 need O(n) space Instead of operating on original time sequence, we also compute their smoothed version, by computing the means of non-overlapping windows Instead of operating on original time sequence, we also compute their smoothed version, by computing the means of non-overlapping windows
Proposed method Window size : power of g=2 Window size : power of g=2 X : original time sequence X : original time sequence Ax h : smoothed version with window of length 2 h Ax h : smoothed version with window of length 2 h Ax 0 : original sequence, Ax 1 : consists of n/2 ticks,..etc Ax 0 : original sequence, Ax 1 : consists of n/2 ticks,..etc Ax h ‘s sufficient statistic need compute every 2 h time ticks Ax h ‘s sufficient statistic need compute every 2 h time ticks At time n, need O(log n) level, for each level compute sufficient statistic At time n, need O(log n) level, for each level compute sufficient statistic
Proposed method In contrast with small lags, the larger one are sparse In contrast with small lags, the larger one are sparse Use cubic spline to interpolate the missing correlation coefficient Use cubic spline to interpolate the missing correlation coefficient
Proposed method Ax h (t) : window average at time tick t for level h Ax h (t) : window average at time tick t for level h Ax h (0) ≡ x t Ax h (0) ≡ x t
Proposed method Sufficient statistics: Sufficient statistics:
EXPERIMENTS
Conclusion Proposed BRAID to detection lag correlation on streaming data Proposed BRAID to detection lag correlation on streaming data At any time At any time Low resource consumption Low resource consumption High accuracy High accuracy