Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU.

Slides:



Advertisements
Similar presentations
Beyond Streams and Graphs: Dynamic Tensor Analysis
Advertisements

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Deepayan ChakrabartiCIKM F4: Large Scale Automated Forecasting Using Fractals -Deepayan Chakrabarti -Christos Faloutsos.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
School of Computer Science Carnegie Mellon Sensor and Graph Mining Christos Faloutsos Carnegie Mellon University & IBM
15-826: Multimedia Databases and Data Mining
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,
STAT 497 APPLIED TIME SERIES ANALYSIS
WindMine: Fast and Effective Mining of Web-click Sequences SDM 2011Y. Sakurai et al.1 Yasushi Sakurai (NTT) Lei Li (Carnegie Mellon Univ.) Yasuko Matsubara.
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Principal Component Analysis
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
CMU SCS Graph and stream mining Christos Faloutsos CMU.
Self-Similarity in Network Traffic Kevin Henkener 5/29/2002.
Based on Slides by D. Gunopulos (UCR)
A Multiresolution Symbolic Representation of Time Series
Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.
Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,
Statistical Methods for long-range forecast By Syunji Takahashi Climate Prediction Division JMA.
CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.
Traffic modeling and Prediction ----Linear Models
Oceanography 569 Oceanographic Data Analysis Laboratory Kathie Kelly Applied Physics Laboratory 515 Ben Hall IR Bldg class web site: faculty.washington.edu/kellyapl/classes/ocean569_.
Cut-And-Stitch: Efficient Parallel Learning of Linear Dynamical Systems on SMPs Lei Li Computer Science Department School of Computer Science Carnegie.
Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN.
InteMon: Intelligent monitoring system for large clusters Evan Hoke, Jimeng Sun and Christos Faloutsos.
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Byoung-Kee Yi N.D.Sidiropoulos Theodore Johnson 國立雲林科技大學 National.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
Fast Mining and Forecasting of Complex Time-Stamped Events Yasuko Matsubara (Kyoto University), Yasushi Sakurai (NTT), Christos Faloutsos (CMU), Tomoharu.
BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.
Carnegie Mellon School of Computer Science Forecasting with Cyber-physical Interactions in Data Centers Lei Li PDL Seminar 9/28/2011.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Gap-filling and Fault-detection for the life under your feet dataset.
k-Shape: Efficient and Accurate Clustering of Time Series
Lei Li Computer Science Department Carnegie Mellon University Pre Proposal Time Series Learning completed work 11/27/2015.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
D YNA MM O : M INING AND S UMMARIZATION OF C OEVOLVING S EQUENCES WITH M ISSING V ALUES Lei Li joint work with Christos Faloutsos, James McCann, Nancy.
Facets: Fast Comprehensive Mining of Coevolving High-order Time Series Hanghang TongPing JiYongjie CaiWei FanQing He Joint Work by Presenter:Wei Fan.
Arizona State University1 Fast Mining of a Network of Coevolving Time Series Wei FanHanghang TongPing JiYongjie Cai.
Enabling Real Time Alerting through streaming pattern discovery Chengyang Zhang Computer Science Department University of North Texas 11/21/2016 CRI Group.
Carnegie Mellon School of Computer Science Forecasting with Cyber-physical Interactions in Data Centers Lei Li PDL Seminar 9/28/2011.
Dense-Region Based Compact Data Cube
Forecasting with Cyber-physical Interactions in Data Centers (part 3)
Data-Streams and Histograms
Non-linear Mining of Competing Local Activities
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Pre Proposal Time Series Learning completed work
Kijung Shin1 Mohammad Hammoud1
Overview Of Clustering Techniques
An Adaptive Middleware for Supporting Time-Critical Event Response
Jimeng Sun · Charalampos (Babis) E
Smita Vijayakumar Qian Zhu Gagan Agrawal
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Finding Periodic Discrete Events in Noisy Streams
Online Analytical Processing Stream Data: Is It Feasible?
Presentation transcript:

Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU

Carnegie Mellon DB/IR '06C. Faloutsos#2 THANK YOU! Prof. Panos Ipeirotis Julia Mills

Carnegie Mellon DB/IR '06C. Faloutsos#3 Joint work with Spiros Papadimitriou (CMU->IBM) Jimeng Sun (CMU/CS) Anthony Brockwell (CMU/Stat) Jeanne Vanbriesen (CMU/CivEng) Greg Ganger (CMU/ECE)

Carnegie Mellon DB/IR '06C. Faloutsos#4 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions

Carnegie Mellon DB/IR '06C. Faloutsos#5 Problem definition - example Each sensor collects data (x 1, x 2, …, x t, …)

Carnegie Mellon DB/IR '06C. Faloutsos#6 Problem definition Given: one or more sequences x 1, x 2, …, x t, … (y 1, y 2, …, y t, … … ) Find –patterns; correlations; outliers –incrementally!

Carnegie Mellon DB/IR '06C. Faloutsos#7 Limitations / Challenges Find patterns using a method that is nimble: limited resources –Memory –Bandwidth, power, CPU incremental: on-line, ‘any-time’ response – single pass (‘you get to see it only once’) automatic: no human intervention –eg., in remote environments

Carnegie Mellon DB/IR '06C. Faloutsos#8 Application domains Sensor devices –Temperature, weather measurements –Road traffic data –Geological observations –Patient physiological data Embedded devices –Network routers –Intelligent (active) disks

Carnegie Mellon DB/IR '06C. Faloutsos#9 Motivation - Applications (cont’d) ‘Smart house’ –sensors monitor temperature, humidity, air quality video surveillance

Carnegie Mellon DB/IR '06C. Faloutsos#10 Motivation - Applications (cont’d) civil/automobile infrastructure –bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring

Carnegie Mellon DB/IR '06C. Faloutsos#11 Motivation - Applications (cont’d) Weather, environment/anti-pollution –volcano monitoring –air/water pollutant monitoring

Carnegie Mellon DB/IR '06C. Faloutsos#12 Motivation - Applications (cont’d) Computer systems –‘Active Disks’ (buffering, prefetching) –web servers (ditto) –network traffic monitoring –...

Carnegie Mellon InteMon w/ Evan Hoke, Jimeng Sun self-* PetaByte data center at CMU

Carnegie Mellon DB/IR '06C. Faloutsos#14 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID conclusions

Carnegie Mellon DB/IR '06C. Faloutsos#15 Single sequence mining - AWSOM with Spiros Papadimitriou (CMU -> IBM) Anthony Brockwell (CMU/Stat)

Carnegie Mellon DB/IR '06C. Faloutsos#16 Problem definition Semi-infinite streams of values (time series) x 1, x 2, …, x t, … Find patterns, forecasts, outliers… Periodicity? (daily) Periodicity? (twice daily) “Noise”??

Carnegie Mellon DB/IR '06C. Faloutsos#17 Requirements / Goals Adapt and handle arbitrary periodic components and nimble (limited resources, single pass) on-line, any-time automatic (no human intervention/tuning)

Carnegie Mellon DB/IR '06C. Faloutsos#18 Overview Introduction / Related work Background Main idea Experimental results

Carnegie Mellon DB/IR '06C. Faloutsos#19 Wavelets Example – Haar transform t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency t xtxt “constant”

Carnegie Mellon DB/IR '06C. Faloutsos#20 Wavelets Why we like them Wavelets compress many real signals well: –Image compression and processing –Vision –Astronomy, seismology, … Wavelet coefficients can be updated as new points arrive

Carnegie Mellon DB/IR '06C. Faloutsos#21 Overview Introduction / Related work Background Main idea Experimental results

Carnegie Mellon DB/IR '06C. Faloutsos#22 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency =

Carnegie Mellon DB/IR '06C. Faloutsos#23 AWSOM xtxt t t W 1,1 t W 1,2 t W 1,3 t W 1,4 t W 2,1 t W 2,2 t W 3,1 t V 4,1 time frequency

Carnegie Mellon DB/IR '06C. Faloutsos#24 AWSOM - idea W l,t W l,t-1 W l,t-2 W l,t   l,1 W l,t-1   l,2 W l,t-2  … W l’,t’-1 W l’,t’-2 W l’,t’ W l’,t’   l’,1 W l’,t’-1   l’,2 W l’,t’-2  …

Carnegie Mellon DB/IR '06C. Faloutsos#25 More details… Update of wavelet coefficients Update of linear models Feature selection –Not all correlations are significant –Throw away the insignificant ones (“noise”) (incremental) (incremental; RLS) (single-pass)

Carnegie Mellon DB/IR '06C. Faloutsos#26 Complexity Model update Space: O  lgN + mk2   O  lgN  Time: O  k 2   O  1  Where –N: number of points (so far) –k:number of regression coefficients; fixed –m: number of linear models; O  lgN  ?

Carnegie Mellon DB/IR '06C. Faloutsos#27 Overview Introduction / Related work Background Main idea Experimental results

Carnegie Mellon DB/IR '06C. Faloutsos#28 Results - Synthetic data Triangle pulse Mix (sine + square) AR captures wrong trend (or none) Seasonal AR estimation fails AWSOMARSeasonal AR

Carnegie Mellon DB/IR '06C. Faloutsos#29 Results - Real data Automobile traffic –Daily periodicity –Bursty “noise” at smaller scales AR fails to capture any trend Seasonal AR estimation fails

Carnegie Mellon DB/IR '06C. Faloutsos#30 Results - real data Sunspot intensity –Slightly time-varying “period” AR captures wrong trend Seasonal ARIMA –wrong downward trend, despite help by human! 

Carnegie Mellon DB/IR '06C. Faloutsos#31 Conclusions Adapt and handle arbitrary periodic components and nimble Limited memory (logarithmic) Constant-time update on-line, any-time Single pass over the data automatic: No human intervention/tuning

Carnegie Mellon DB/IR '06C. Faloutsos#32 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID conclusions

Carnegie Mellon DB/IR '06C. Faloutsos#33 Part 2 SPIRIT: Mining co-evolving streams [Papadimitriou, Sun, Faloutsos, VLDB05]

Carnegie Mellon DB/IR '06C. Faloutsos#34 Motivation Eg., chlorine concentration in water distribution network

Carnegie Mellon DB/IR '06C. Faloutsos#35 Motivation water distribution network normal operation May have hundreds of measurements, but it is unlikely they are completely unrelated! Phase 1Phase 2Phase 3 : : : chlorine concentrations

Carnegie Mellon DB/IR '06C. Faloutsos#36 Phase 1Phase 2Phase 3 : : : Motivation water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak

Carnegie Mellon DB/IR '06C. Faloutsos#37 Phase 1Phase 2Phase 3 : : : Motivation water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak

Carnegie Mellon DB/IR '06C. Faloutsos#38 Motivation actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends Phase 1 : : : chlorine concentrations Phase 1 k = 1

Carnegie Mellon DB/IR '06C. Faloutsos#39 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 actual measurements (n streams) k hidden variable(s) k = 2 : : :

Carnegie Mellon DB/IR '06C. Faloutsos#40 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1 : : :

Carnegie Mellon DB/IR '06C. Faloutsos#41 Discover “hidden” (latent) variables for: –Summarization of main trends for users –Efficient forecasting, spotting outliers/anomalies and the usual: nimble: Limited memory requirements on-line, any-time: (single pass etc) automatic: No special parameters to tune Goals

Carnegie Mellon DB/IR '06C. Faloutsos#42 Related work Stream mining Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01]

Carnegie Mellon DB/IR '06C. Faloutsos#43 Related work Stream mining Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004] Queries on streams [Dobra, Garofalakis, Gehrke, et al / SIGMOD02], [Madden, Franklin, Hellerstein, et al / OSDI02], [Considine, Li, Kollios, et al / ICDE04], [Hammad, Aref, Elmagarmid / SSDBM03] …

Carnegie Mellon DB/IR '06C. Faloutsos#44 Overview Part 2 Method Experiments Conclusions & Other work

Carnegie Mellon DB/IR '06C. Faloutsos#45 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

Carnegie Mellon DB/IR '06C. Faloutsos#46 1. How to capture correlations? 20 o C 30 o C Temperature t 1 First sensor time

Carnegie Mellon DB/IR '06C. Faloutsos#47 1. How to capture correlations? First sensor Second sensor 20 o C 30 o C Temperature t 2 time

Carnegie Mellon DB/IR '06C. Faloutsos#48 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature t 1 Correlations: Let’s take a closer look at the first three value-pairs… Temperature t 2

Carnegie Mellon DB/IR '06C. Faloutsos#49 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature t 2 Temperature t 1 First three lie (almost) on a line in the space of value- pairs…  O(n) numbers for the slope, and  One number for each value-pair (offset on line) offset = “hidden variable” time=1 time=2 time=3

Carnegie Mellon DB/IR '06C. Faloutsos#50 1. How to capture correlations 20 o C30 o C 20 o C 30 o C Temperature t 2 Temperature t 1 Other pairs also follow the same pattern: they lie (approximately) on this line

Carnegie Mellon DB/IR '06C. Faloutsos#51 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

Carnegie Mellon Incremental updates error 20 o C 30 o C 20 o C 30 o C Temperature T 2 Temperature T 1

Carnegie Mellon Incremental updates Algorithm runs in O(n) where n= # of streams no need to access old data error 20 o C 30 o C 20 o C30 o C Temperature T 1

Carnegie Mellon DB/IR '06C. Faloutsos#54 Stream correlations Principal Component Analysis (PCA) The “line” is the first principal component (PC) This line is optimal: it minimizes the sum of squared projection errors

Carnegie Mellon DB/IR '06C. Faloutsos#55 2. Incremental update Given number of hidden variables k Assuming k is known We know how to update the slope For each new point x and for i = 1, …, k : y i := w i T x(proj. onto w i ) d i  d i + y i 2 (energy  i-th eigenval.) e i := x – y i w i (error) w i  w i + (1/d i ) y i e i (update estimate) x  x – y i w i (repeat with remainder) y1y1 w1w1 x e1e1 w 1 updated

Carnegie Mellon DB/IR '06C. Faloutsos#56 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust k, the number of hidden variables?

Carnegie Mellon DB/IR '06C. Faloutsos#57 Answer When the reconstruction accuracy is too low (say, <95%) then introduce another hidden variable (k++) [How to initialize its values: tricky]

Carnegie Mellon DB/IR '06C. Faloutsos#58 Missing values 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 true values (pair) all possible value pairs (given only t 1 ) best guess (given correlations: intersection)

Carnegie Mellon DB/IR '06C. Faloutsos#59 Forecasting ? Assume we want to forecast the next value for a particular stream (e.g. auto-regression) n streams

Carnegie Mellon DB/IR '06C. Faloutsos#60 Forecasting Option 1: One complex model per stream –Next value = function of previous values on all streams –Captures correlations –Too costly! [ ~ O(n 3 ) ] + n streams

Carnegie Mellon DB/IR '06C. Faloutsos#61 Forecasting Option 1: One complex model per stream Option 2: One simple model per stream –Next value = function of previous value on same stream –Worse accuracy, but maybe acceptable –But, still need n models + n streams

Carnegie Mellon DB/IR '06C. Faloutsos#62 Forecasting n streams hidden variables k hidden vars k << n and already capture correlations + Only k simple models Efficiency & robustness

Carnegie Mellon DB/IR '06C. Faloutsos#63 Time/space requirements Incremental PCA O(nk) space (total) and time (per tuple), i.e., Independent of # points Linear w.r.t. # streams (n) Linear w.r.t. # hidden variables (k) In fact, Can be done in real time

Carnegie Mellon DB/IR '06C. Faloutsos#64 Overview Part 2 Method Experiments Conclusions & Other work

Carnegie Mellon DB/IR '06C. Faloutsos#65 Experiments Chlorine concentration 166 streams 2 hidden variables (~4% error) Measurements Reconstruction [CMU Civil Engineering]

Carnegie Mellon DB/IR '06C. Faloutsos#66 Experiments Chlorine concentration hidden variables Both capture global, periodic pattern Second: ~ first, but phase-shifted Can express any phase-shift… [CMU Civil Engineering]

Carnegie Mellon DB/IR '06C. Faloutsos#67 Experiments Light measurements measurement reconstruction 54 sensors 2-4 hidden variables (~6% error)

Carnegie Mellon DB/IR '06C. Faloutsos#68 Experiments Light measurements 1 & 2: main trend (as before) 3 & 4: potential anomalies and outliers hidden variables intermittent

Carnegie Mellon DB/IR '06C. Faloutsos#69 Conclusions SPIRIT : Discovers hidden variables for –Summarization of main trends for users –Efficient forecasting, spotting outliers/anomalies Incremental, real time computation nimble: With limited memory automatic: No special parameters to tune

Carnegie Mellon DB/IR '06C. Faloutsos#70 Outline Problem and motivation Single-sequence mining: AWSOM Co-evolving sequences: SPIRIT Lag correlations: BRAID Conclusions

Carnegie Mellon DB/IR '06C. Faloutsos#71 Part 3: BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai, Spiros Papadimitriou, Christos Faloutsos SIGMOD’05

Carnegie Mellon DB/IR '06C. Faloutsos#72 Lag Correlations Examples –A decrease in interest rates typically precedes an increase in house sales by a few months –Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later

Carnegie Mellon DB/IR '06C. Faloutsos#73 Lag Correlations Example of lag-correlated sequences These sequences are correlated with lag l=1300 time-ticks CCF (Cross-Correlation Function)

Carnegie Mellon DB/IR '06C. Faloutsos#74 Lag Correlations Example of lag-correlated sequences CCF (Cross-Correlation Function) how to compute it quickly cheaply incrementally

Carnegie Mellon DB/IR '06C. Faloutsos#75 Challenging Problems Problem definitions –For given two co-evolving sequences X and Y, determine Whether there is a lag correlation If yes, what is the lag length l –For given k numerical sequences, X 1,…,X k, report Which pairs have a lag correlation The corresponding lag for each pair

Carnegie Mellon DB/IR '06C. Faloutsos#76 Our solution Ideal characteristics: –‘Any-time’ processing, and fast Computation time per time tick is constant –Nimble Memory space requirement is sub-linear of sequence length –Accurate Approximation introduces small error

Carnegie Mellon DB/IR '06C. Faloutsos#77 Sequence indexing –Agrawal et al. (FODO 1993) –Faloutsos et al. (SIGMOD 1994) –Keogh et al. (SIGMOD 2001) Compression (wavelet and random projections) –Gilbert et al. (VLDB 2001), Guha et al. (VLDB 2004) –Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD 2003) Data Stream Management –Abadi et al. (VLDB Journal 2003) –Motwani et al. (CIDR 2003) –Chandrasekaran et al. (CIDR 2003) –Cranor et al. (SIGMOD 2003) Related Work

Carnegie Mellon DB/IR '06C. Faloutsos#78 Related Work Pattern discovery –Clustering for data streams Guha et al. (TKDE 2003) –Monitoring multiple streams Zhu et al. (VLDB 2002) –Forecasting Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003) None of previously published methods focuses on the problem

Carnegie Mellon DB/IR '06C. Faloutsos#79 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

Carnegie Mellon DB/IR '06C. Faloutsos#80 Main Idea (1) Incremental compution –Sufficient statistics Sum of X : Square sum of X : Inner-product for X and the shifted Y : –Compute R(l) incrementally: Covariance of X and Y: Variance of X:

Carnegie Mellon DB/IR '06C. Faloutsos#81 Main Idea (2) Lag Correlation Sequence smoothing t=n Time

Carnegie Mellon DB/IR '06C. Faloutsos#82 Main Idea (2) Lag Correlation Level h=0 t=n Time Sequence smoothing –Means of windows for each level –Sufficient statistics computed from the means –CCF computed from the sufficient statistics –But, it allows a partial redundancy

Carnegie Mellon DB/IR '06C. Faloutsos#83 Main Idea (3) Lag Correlation Level h=0 t=n Time Geometric lag probing

Carnegie Mellon DB/IR '06C. Faloutsos#84 Main Idea (3) Lag Correlation Level h=0 t=n Time Geometric lag probing –Use colored windows –Keep track of only a geometric progression of the lag values: l={0,1,2,4,8,…,2 h,…} –Use a cubic spline to interpolate

Carnegie Mellon DB/IR '06C. Faloutsos#85 Overview Introduction / Related work Background Main ideas Theoretical analysis Experimental results

Carnegie Mellon DB/IR '06C. Faloutsos#86 Experimental results Setup –Intel Xeon 2.8GHz, 1GB memory, Linux –Datasets: Sines, SpikeTrains, Humidity, Light, Temperature, Kursk, Sunspots –Enhanced BRAID, b=16 Evaluation –Estimation error of lag correlations –Computation time

Carnegie Mellon DB/IR '06C. Faloutsos#87 Detecting Lag Correlations (2) SpikeTrains CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

Carnegie Mellon DB/IR '06C. Faloutsos#88 Detecting Lag Correlations (3) Humidity CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

Carnegie Mellon DB/IR '06C. Faloutsos#89 Detecting Lag Correlations (4) Light CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

Carnegie Mellon DB/IR '06C. Faloutsos#90 Detecting Lag Correlations (5) Kursk CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

Carnegie Mellon DB/IR '06C. Faloutsos#91 Estimation Error Largest relative error is about 1% Sunspots Kursk Light Humidity SpikeTrains Sines BRAID Naive Estimation error (%) Lag correlation Datasets

Carnegie Mellon DB/IR '06C. Faloutsos#92 Performance Almost linear w.r.t. sequence length Up to 40,000 times faster

Carnegie Mellon DB/IR '06C. Faloutsos#93 Group Lag Correlations Two correlated pairs from 55 Temperature sequences Each sensor is located in a different place Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48 #16#19#47 #48

Carnegie Mellon DB/IR '06C. Faloutsos#94 Conclusions Automatic lag correlation detection on stream data incremental – online, ‘any-time’ nimble –O(log n) space, O(1) time to update the statistics –Up to 40,000 times faster than the naive implementation Accurate –Detecting the correct lag within 1% relative error or less

Carnegie Mellon DB/IR '06C. Faloutsos#95 Overall Conclusions Mining streaming numerical data: challenging! Extensions: streaming matrix data (eg., network traffic matrix) IP-source IP-destination time

Carnegie Mellon DB/IR '06C. Faloutsos#96 Thank you christos cs.cmu.edu [InteMon demo]