Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

Slides:



Advertisements
Similar presentations
Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Advertisements

Beyond Streams and Graphs: Dynamic Tensor Analysis
Slides from: Doug Gray, David Poole
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
Component Analysis (Review)
Use of Kalman filters in time and frequency analysis John Davis 1st May 2011.
Dynamic Bayesian Networks (DBNs)
Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU.
Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)
15-826: Multimedia Databases and Data Mining
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
SCS CMU Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong Aug , 2008, Las Vegas.
x – independent variable (input)
Principal Component Analysis
Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Context Compression: using Principal Component Analysis for Efficient Wireless Communications Christos Anagnostopoulos & Stathes Hadjiefthymiades Pervasive.
Overview Of Clustering Techniques D. Gunopulos, UCR.
Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams Amit Manjhi, Suman Nath, Phillip B. Gibbons Carnegie Mellon University.
1 Toward Sophisticated Detection With Distributed Triggers Ling Huang* Minos Garofalakis § Joe Hellerstein* Anthony Joseph* Nina Taft § *UC Berkeley §
1 EE 616 Computer Aided Analysis of Electronic Networks Lecture 9 Instructor: Dr. J. A. Starzyk, Professor School of EECS Ohio University Athens, OH,
A Multiresolution Symbolic Representation of Time Series
Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.
Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,
CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.
Optimal Placement and Selection of Camera Network Nodes for Target Localization A. O. Ercan, D. B. Yang, A. El Gamal and L. J. Guibas Stanford University.
Traffic modeling and Prediction ----Linear Models
Cut-And-Stitch: Efficient Parallel Learning of Linear Dynamical Systems on SMPs Lei Li Computer Science Department School of Computer Science Carnegie.
BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
BRAID: Stream Mining through Group Lag Correlations Yasushi Sakurai Spiros Papadimitriou Christos Faloutsos SIGMOD 2005.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.
Gap-filling and Fault-detection for the life under your feet dataset.
Lei Li Computer Science Department Carnegie Mellon University Pre Proposal Time Series Learning completed work 11/27/2015.
Optical Flow. Distribution of apparent velocities of movement of brightness pattern in an image.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
Presented by Ho Wai Shing
Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Uncertain Observation Times Shaunak Chatterjee & Stuart Russell Computer Science Division University of California, Berkeley.
D YNA MM O : M INING AND S UMMARIZATION OF C OEVOLVING S EQUENCES WITH M ISSING V ALUES Lei Li joint work with Christos Faloutsos, James McCann, Nancy.
Locations. Soil Temperature Dataset Observations Data is – Correlated in time and space – Evolving over time (seasons) – Gappy (Due to failures) – Faulty.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Arizona State University1 Fast Mining of a Network of Coevolving Time Series Wei FanHanghang TongPing JiYongjie Cai.
MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar 8: Time Series Analysis & Forecasting – Part 1
SCS CMU Speaker Hanghang Tong Colibri: Fast Mining of Large Static and Dynamic Graphs Speaking Skill Requirement.
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
Enabling Real Time Alerting through streaming pattern discovery Chengyang Zhang Computer Science Department University of North Texas 11/21/2016 CRI Group.
Forecasting with Cyber-physical Interactions in Data Centers (part 3)
Data-Streams and Histograms
Kijung Shin1 Mohammad Hammoud1
Query-Friendly Compression of Graph Streams
Overview Of Clustering Techniques
Mining Unusual Patterns in Data Streams in Multi-Dimensional Space
An Adaptive Middleware for Supporting Time-Critical Event Response
Jimeng Sun · Charalampos (Babis) E
A Framework for Clustering Evolving Data Streams
Descriptive Statistics vs. Factor Analysis
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Online Analytical Processing Stream Data: Is It Feasible?
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim, Norway

2 Motivation Several settings where many deployed sensors measure some quantity—e.g.: – Traffic in a network – Temperatures in a large building – Chlorine concentration in water distribution network Values are typically correlated Would be very useful if we could summarize them on the fly

3 Motivation water distribution network normal operation May have hundreds of measurements, but it is unlikely they are completely unrelated! Phase 1Phase 2Phase 3 : : : chlorine concentrations sensors near leak sensors away from leak

4 Phase 1Phase 2Phase 3 : : : Motivation water distribution network normal operationmajor leak May have hundreds of measurements, but it is unlikely they are completely unrelated! chlorine concentrations sensors near leak sensors away from leak

5 Motivation actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends Phase 1 : : : chlorine concentrations Phase 1 k = 1

6 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 actual measurements (n streams) k hidden variable(s) k = 2 : : :

7 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1 : : :

8 Discover “hidden” (latent) variables for: – Summarization of main trends for users – Efficient forecasting, spotting outliers/anomalies Incremental, real-time computation Limited memory requirements Goals

9 Related work Stream mining Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01] Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004] Queries on streams [Dobra, Garofalakis, Gehrke, et al / SIGMOD02], [Madden, Franklin, Hellerstein, et al / OSDI02], [Considine, Li, Kollios, et al / ICDE04], [Hammad, Aref, Elmagarmid / SSDBM03] …

10 Overview Method outline Experiments

11 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

12 1. How to capture correlations? 20 o C 30 o C Temperature T 1 First sensor time

13 1. How to capture correlations? First sensor Second sensor 20 o C 30 o C Temperature T 2 time

14 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 1 Correlations: Let’s take a closer look at the first three value-pairs… Temperature T 2

15 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 2 Temperature T 1 First three lie (almost) on a line in the space of value-pairs…  O(n) numbers for the slope, and  One number for each value-pair (offset on line) offset = “hidden variable” time=1 time=2 time=3

16 1. How to capture correlations 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 Other pairs also follow the same pattern: they lie (approximately) on this line

17 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

18 2. Incremental update error 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error New value

19 2. Incremental update error 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude  O(n) time New value

20 2. Incremental update 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude

21 Stream correlations Principal Component Analysis (PCA) The “line” is the first principal component (PC) vector This line is optimal: it minimizes the sum of squared projection errors

22 2. Incremental update Given number of hidden variables k Assuming k is known We know how to update the slope (detailed equations in paper) For each new point x and for i = 1, …, k : y i := w i T x(proj. onto w i ) d i  d i + y i 2 (energy  i-th eigenval.) e i := x – y i w i (error) w i  w i + (1/d i ) y i e i (update estimate) x  x – y i w i (repeat with remainder) y1y1 w1w1 x e1e1 w 1 updated

23 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust k, the number of hidden variables?

24 T3T3 3. Number of hidden variables If we had three sensors with similar measurements Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space T1T1 T2T2 value-tuple space

25 T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation T1T1 T2T2 value-tuple space

26 T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation But a plane will do (two hidden variables, k = 2) T1T1 T2T2 value-tuple space

27 Number of hidden variables (PCs) Keep track of energy maintained by approximation with k variables (PCs): – Reconstruction accuracy, w.r.t. total squared error Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold – If below 95%, k  k  1 – If above 98%, k  k  1

28 Missing values 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 true values (pair) all possible value pairs (given only t 1 ) best guess (given correlations: intersection)

29 Forecasting ? Assume we want to forecast the next value for a particular stream (e.g. auto-regression) n streams

30 Forecasting Option 1: One complex model per stream – Next value = function of previous values on all streams – Captures correlations – Too costly! [ ~ O(n 3 ) ] + n streams

31 Forecasting Option 1: One complex model per stream Option 2: One simple model per stream – Next value = function of previous value on same stream – Worse accuracy, but maybe acceptable – But, still need n models + n streams

32 Forecasting n streams hidden variables k hidden vars k << n and already capture correlations + Only k simple models Efficiency & robustness

33 Time/space requirements Incremental PCA O(nk) space (total) and time (per tuple), i.e., Independent of # points (t) Linear w.r.t. # streams (n) Linear w.r.t. # hidden variables (k) In fact, Can be done in real time [demo]

34 Overview Method outline Experiments

35 Experiments Chlorine concentration 166 streams 2 hidden variables (~4% error) Measurements Reconstruction [CMU Civil Engineering]

36 Experiments Chlorine concentration hidden variables [CMU Civil Engineering] Both capture global, periodic pattern Second: ~ first, but “phase-shifted” Can express any “phase-shift”…

37 Experiments Light measurements 54 sensors 2-4 hidden variables (~6% error) measurement reconstruction

38 Experiments Light measurements 1 & 2: main trend (as before) 3 & 4: potential anomalies and outliers hidden variables intermittent

39 Experiments Missing values Correlations already captured by hidden variables Provide information about missing values – Quickly back on track, if mis-estimated reconstruct sensor 7 given everything else (via hidden variables) [CMU ECE]

40 Experiments Missing values Correlations already captured by hidden variables Provide information about missing values – Quickly back on track, if mis-estimated reconstruct sensor 8 given everything else (via hidden variables) [CMU ECE]

41 Wall-clock times time vs. stream size (t) time vs. #streams (n) time vs. #hid. vars (k) constant time per tuple and per stream time (sec) stream size (time ticks t) time (sec) # of streams (n)# of PCs (k)

42 Conclusion Many settings with hundreds of streams, but – Stream values are, by nature, related – In reality, there are only a few variables Discover hidden variables for – Summarization of main trends for users – Efficient forecasting, spotting outliers/anomalies Incremental, real time computation With limited memory

43 End Thank you