BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.

Slides:



Advertisements
Similar presentations
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Slide 1 Bayesian Model Fusion: Large-Scale Performance Modeling of Analog and Mixed- Signal Circuits by Reusing Early-Stage Data Fa Wang*, Wangyang Zhang*,
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Fast, Memory-Efficient Traffic Estimation by Coincidence Counting Fang Hao 1, Murali Kodialam 1, T. V. Lakshman 1, Hui Zhang 2, 1 Bell Labs, Lucent Technologies.
Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU.
Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Validation and Monitoring Measures of Accuracy Combining Forecasts Managing the Forecasting Process Monitoring & Control.
Query Assurance on Data Streams  Ke Yi (AT&T Labs, now at HKUST)  Feifei Li (Boston U, now at Florida State)  Marios Hadjieleftheriou (AT&T Labs) 
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
On Discovering Moving Clusters in Spatio-temporal Data Panos Kalnis National University of Singapore Nikos Mamoulis University of Hong Kong Spiridon Bakiras.
Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.
Communication-Efficient Distributed Monitoring of Thresholded Counts Ram Keralapura, UC-Davis Graham Cormode, Bell Labs Jai Ramamirtham, Bell Labs.
A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science,
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.
Adaptive Sampling in Distributed Streaming Environment Ankur Jain 2/4/03.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.
A Multiresolution Symbolic Representation of Time Series
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
RACE: Time Series Compression with Rate Adaptivity and Error Bound for Sensor Networks Huamin Chen, Jian Li, and Prasant Mohapatra Presenter: Jian Li.
R. Werner Solar Terrestrial Influences Institute - BAS Time Series Analysis by descriptive statistic.
Fault Diagnosis System for Wireless Sensor Networks Praharshana Perera Supervisors: Luciana Moreira Sá de Souza Christian Decker.
Fast and Exact Monitoring of Co-evolving Data Streams Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Naonori Ueda (NTT) Masatoshi Yoshikawa (Kyoto.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
New Streaming Algorithms for Fast Detection of Superspreaders Shobha Venkataraman* Joint work with: Dawn Song*, Phillip Gibbons ¶,
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
Fast Mining and Forecasting of Complex Time-Stamped Events Yasuko Matsubara (Kyoto University), Yasushi Sakurai (NTT), Christos Faloutsos (CMU), Tomoharu.
BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005.
Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.
AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.
Experimental Results ■ Observations:  Overall detection accuracy increases as the length of observation window increases.  An observation window of 100.
Carnegie Mellon School of Computer Science Forecasting with Cyber-physical Interactions in Data Centers Lei Li PDL Seminar 9/28/2011.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Abdullah Mueen Eamonn Keogh University of California, Riverside.
k-Shape: Efficient and Accurate Clustering of Time Series
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Estimating PageRank on Graph Streams Atish Das Sarma (Georgia Tech) Sreenivas Gollapudi, Rina Panigrahy (Microsoft Research)
Forecasting Demand. Problems with Forecasts Forecasts are Usually Wrong. Every Forecast Should Include an Estimate of Error. Forecasts are More Accurate.
Enabling Real Time Alerting through streaming pattern discovery Chengyang Zhang Computer Science Department University of North Texas 11/21/2016 CRI Group.
Dense-Region Based Compact Data Cube
Forecasting with Cyber-physical Interactions in Data Centers (part 3)
On-line Detection of Real Time Multimedia Traffic
Distributed Network Traffic Feature Extraction for a Real-time IDS
Augmented Sketch: Faster and More Accurate Stream Processing
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Kijung Shin1 Mohammad Hammoud1
An Adaptive Middleware for Supporting Time-Critical Event Response
Jimeng Sun · Charalampos (Babis) E
Smita Vijayakumar Qian Zhu Gagan Agrawal
Range-Efficient Computation of F0 over Massive Data Streams
Heavy Hitters in Streams and Sliding Windows
Online Analytical Processing Stream Data: Is It Feasible?
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.)

SIGMOD 2005 Y. Sakurai et al 2 Motivation Data-stream applications  Network analysis  Sensor monitoring  Financial data analysis  Moving object tracking Goal  Monitor multiple numerical streams  Determine which pairs are correlated with lags  Report the value of each such lag (if any)

SIGMOD 2005 Y. Sakurai et al 3 Lag Correlations Examples  A decrease in interest rates typically precedes an increase in house sales by a few months  Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later  High CPU utilization on server 1 precedes high CPU utilization for server 2 by a few minutes

SIGMOD 2005 Y. Sakurai et al 4 Lag Correlations Example of lag-correlated sequences These sequences are correlated with lag l=1300 time-ticks CCF (Cross-Correlation Function)

SIGMOD 2005 Y. Sakurai et al 5 Lag Correlations CCF (Cross-Correlation Function) Example of lag-correlated sequences  Fast (high performance)  Nimble (Low memory consumption)  Accurate (good approximation)

SIGMOD 2005 Y. Sakurai et al 6 Problem #1: PAIR of sequences For given two co-evolving sequences X and Y, determine  Whether there is a lag correlation  If yes, what is the lag length l Any time, on semi-infinite streams ? yes; l = 1,300 X Y

SIGMOD 2005 Y. Sakurai et al 7 Problem #2: k-way For given k numerical sequences, X 1,…,X k, report  Which pairs (if any) have a lag correlation  The corresponding lag for such pairs again, ‘any time’, streaming fashion ? X 1 and X 2 ; l = 1, X1X1 X2X2 XkXk

SIGMOD 2005 Y. Sakurai et al 8 Our solution, BRAID characteristics:  ‘Any-time’ processing, and fast Computation time per time tick is constant  Nimble Memory space requirement is sub-linear of sequence length  Accurate Approximation introduces small error

SIGMOD 2005 Y. Sakurai et al 9 Sequence indexing  Agrawal et al. (FODO 1993)  Faloutsos et al. (SIGMOD 1994)  Keogh et al. (SIGMOD 2001) Compression (wavelet and random projections)  Gilbert et al. (VLDB 2001)  Guha et al. (VLDB 2004)  Dobra et al.(SIGMOD 2002)  Ganguly et al.(SIGMOD 2003) Related Work

SIGMOD 2005 Y. Sakurai et al 10 Data Stream Management  Abadi et al. (VLDB Journal 2003)  Motwani et al. (CIDR 2003)  Chandrasekaran et al. (CIDR 2003)  Cranor et al. (SIGMOD 2003) Related Work

SIGMOD 2005 Y. Sakurai et al 11 Related Work Pattern discovery  Clustering for data streams Guha et al. (TKDE 2003)  Monitoring multiple streams Zhu et al. (VLDB 2002)  Forecasting Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003) None of previously published methods focuses on the problem

SIGMOD 2005 Y. Sakurai et al 12 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

SIGMOD 2005 Y. Sakurai et al 13 Background CCF (Cross-Correlation Function) positively correlated un-correlated ++ anti-correlated (lower than -  ) Lag correlation Lag Correlation

SIGMOD 2005 Y. Sakurai et al 14 Background Definition of ‘ score ’, the absolute value of R(l) Lag correlation  Given a threshold ,  A local maximum  The earliest such maximum, if more maxima exist details

SIGMOD 2005 Y. Sakurai et al 15 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

SIGMOD 2005 Y. Sakurai et al 16 Why not ‘ naive ’ ? Naive solution:  Compute correlation coefficient for each lag l = 0, 1, 2, 3, …, n/2 But,  O(n) space  O(n 2 ) time or O(n log n) time w/ FFT t=n Time Lag Correlation n/2 0

SIGMOD 2005 Y. Sakurai et al 17 Main Idea (1) Incremental computing:  the correlation coefficient of two sequences is ‘algebraic’ -> can be computed incrementally we need to maintain only 6 ‘sufficient statistics’:  Sequence length n  Sum of X, Square sum of X  Sum of Y, Square sum of Y  Inner-product for X and the shifted Y

SIGMOD 2005 Y. Sakurai et al 18 Main Idea (1) Incremental computing: Sequence length n Sum of X : Square sum of X : Inner-product for X and the shifted Y :  Compute R(l) incrementally: Covariance of X and Y: Variance of X: details

SIGMOD 2005 Y. Sakurai et al 19 Main Idea (1) Complexity Naive (incremental) BRAID Space O(n)O(n)O(n)O(n) Comp. time O(n log n)O(n)O(n) Better, but not good enough!

SIGMOD 2005 Y. Sakurai et al 20 Main Idea (2) Lag Correlation Geometric lag probing

SIGMOD 2005 Y. Sakurai et al 21 Main Idea (2) Lag Correlation Geometric lag probing ie., compute the correlation coefficient for lag: l = 0, 1, 2, 4,... 2 h O(log n) estimations

SIGMOD 2005 Y. Sakurai et al 22 Main Idea (2) Geometric lag probing But, so far, we still need O(n) space because the longest lag is n/2 Naive (incremental) BRAID Space O(n)O(n)O(n)O(n) Comp. time O(n log n)O(n)O(n)O(log n)

SIGMOD 2005 Y. Sakurai et al 23 Main Idea (3) Lag Correlation Sequence smoothing t=n Time Reminder: Naïve:

SIGMOD 2005 Y. Sakurai et al 24 Main Idea (3) Lag Correlation Level h=0 t=n Time Sequence smoothing  Means of windows for each level  Sufficient statistics computed from the means  CCF computed from the sufficient statistics  But, it allows a partial redundancy

SIGMOD 2005 Y. Sakurai et al 25 Putting it all together: Lag Correlation Level h=0 t=n Time Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…}

SIGMOD 2005 Y. Sakurai et al 26 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=0 t=n Time h=0 Y X l=0

SIGMOD 2005 Y. Sakurai et al 27 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=0 t=n Time h=0 Y X l=1

SIGMOD 2005 Y. Sakurai et al 28 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=1 t h =n/2 Time h=1 Y X l=2

SIGMOD 2005 Y. Sakurai et al 29 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=2 Time h=2 Y X t h =n/4 l=4

SIGMOD 2005 Y. Sakurai et al 30 Putting it all together: Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} Lag Correlation Level h=3 Time h=3 Y X t h =n/8 l=8

SIGMOD 2005 Y. Sakurai et al 31 Putting it all together: Lag Correlation Level h=0 t=n Time Geometric lag probing + smoothing  Use colored windows  Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…}  Use a cubic spline to interpolate

SIGMOD 2005 Y. Sakurai et al 32 Thus: Complexity Naive (incremental) BRAID Space O(n)O(n)O(n)O(n)O(log n) Comp. time O(n log n)O(n)O(n)O(1) * (*) Computation time: O(logn) And actually, amortized time: O(1)

SIGMOD 2005 Y. Sakurai et al 33 Overview Introduction / Related work Background Main ideas  enhancing the accuracy Theoretical analysis Experimental results details

SIGMOD 2005 Y. Sakurai et al 34 Enhanced Probing Scheme Q: How to probe more densely than 2 h ? Lag Correlation Level h=0 t=n Time

SIGMOD 2005 Y. Sakurai et al 35 Enhanced Probing Scheme Q: How to probe more densely than 2 h ? A: probe in a mixture of geometric and arithmetic progressions Lag Correlation Level h=0 t=n Time

SIGMOD 2005 Y. Sakurai et al 36 Enhanced Probing Scheme Basic scheme: b=1 (one number for each level) Enhanced scheme: b>1  Example of b=4  Probing the CCF in a mixture of geometric and arithmetic progressions: l={0,1,…,7;8,10,12,14;16,20,24,28;32,40,…} Level h=0 Time t=n Correlation Lag step:1step: 2 step: 4

SIGMOD 2005 Y. Sakurai et al 37 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

SIGMOD 2005 Y. Sakurai et al 38 Theoretical Analysis - Accuracy Effect of smoothing Effect of geometric lag probing For sequences with low frequencies, smoothing introduces only small error BRAIDS will provide no error, if lag probing satisfies the sampling theorem (Nyquist’s)

SIGMOD 2005 Y. Sakurai et al 39 Effect of geometric lag probing  Informally, BRAIDS will provide no error, if lag probing satisfies the sampling theorem (Nyquist’s)  Formally: Theorem 2 f R : the Nyquist frequency of CCF, f R = min( f x, f y ) f x, f y : the Nyquist frequencies of X and Y Theoretical Analysis - Accuracy BRAID will find the lag correlations perfectly, if details

SIGMOD 2005 Y. Sakurai et al 40 Theoretical Analysis - Complexity Naive solution  O(n) space  O(n) time per time tick BRAID  O(log n) space  O(1) time for updating sufficient statistics  O(log n) time for interpolating (when output is required) details

SIGMOD 2005 Y. Sakurai et al 41 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

SIGMOD 2005 Y. Sakurai et al 42 Experimental results Setup  Intel Xeon 2.8GHz, 1GB memory, Linux  Datasets: Synthetic: Sines, SpikeTrains, Real: Humidity, Light, Temperature, Kursk, Sunspots  Enhanced BRAID, b=16

SIGMOD 2005 Y. Sakurai et al 43 Experimental results Evaluation  Accuracy for CCF  Accuracy for the lag estimation  Computation time  k -way lag correlations

SIGMOD 2005 Y. Sakurai et al 44 Accuracy for CCF (1) Sines CCF (Cross-Correlation Function) BRAID perfectly estimates the correlation coefficients of the sinusoidal wave

SIGMOD 2005 Y. Sakurai et al 45 Accuracy for CCF (2) SpikeTrains CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

SIGMOD 2005 Y. Sakurai et al 46 Accuracy for CCF (3) Humidity (Real data) CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

SIGMOD 2005 Y. Sakurai et al 47 Accuracy for CCF (4) Light (Real data) CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

SIGMOD 2005 Y. Sakurai et al 48 Accuracy for CCF (5) Kursk (Real data) CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

SIGMOD 2005 Y. Sakurai et al 49 Accuracy for CCF (6) Sunspots (Real data) CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

SIGMOD 2005 Y. Sakurai et al 50 Experimental results Evaluation  Accuracy for CCF  Accuracy for the lag estimation  Computation time  k -way lag correlations

SIGMOD 2005 Y. Sakurai et al 51 Estimation Error of Lag Correlations Largest relative error is about 1% Datasets Lag correlationEstimation error (%) NaiveBRAID Sines SpikeTrains Humidity Light Kursk Sunspots

SIGMOD 2005 Y. Sakurai et al 52 Experimental results Evaluation  Accuracy for CCF  Accuracy for the lag estimation  Computation time  k -way lag correlations

SIGMOD 2005 Y. Sakurai et al 53 Computation time Reduce computation time dramatically Up to 40,000 times faster

SIGMOD 2005 Y. Sakurai et al 54 Experimental results Evaluation  Accuracy for CCF  Accuracy for the lag estimation  Computation time  k -way lag correlations

SIGMOD 2005 Y. Sakurai et al 55 Group Lag Correlations 55 Temperature sequences Two correlated pairs Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48 #16#19#47 #48

SIGMOD 2005 Y. Sakurai et al 56 Conclusions Automatic lag correlation detection on data stream 1. ‘Any-time’ 2. Nimble  O(log n) space, O(1) time to update the statistics 3. Fast  Up to 40,000 times faster than the naive implementation 4. Accurate  within 1% relative error or less

SIGMOD 2005 Y. Sakurai et al 57 Effect of geometric lag probing  Informally, BRAIDS will provide no error, if lag probing satisfies the sampling theorem (Nyquist’s)  Formally: Theorem 2 f R : the Nyquist frequency of CCF, f R = min( f x, f y ) f x, f y : the Nyquist frequencies of X and Y Theoretical Analysis - Accuracy BRAID will find the lag correlations perfectly, if details

SIGMOD 2005 Y. Sakurai et al 58 Effect of Probing Dataset: Sines Lag correlation with b=1 l R =1024

SIGMOD 2005 Y. Sakurai et al 59 Effect of Probing Dataset: Light Lag correlation with b=1 l R =630