Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.

Advertisements

Beyond Streams and Graphs: Dynamic Tensor Analysis

CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.

CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

STAGGER: Periodicity Mining of Data Streams using Expanding Sliding Windows Mohamed G. Elfeky Walid G.Aref Ahmed K. Elmagarmid ICDM /10/021Chen.

BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.

Dynamic Bayesian Networks (DBNs)

Carnegie Mellon DB/IR '06C. Faloutsos#1 Data Mining on Streams Christos Faloutsos CMU.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

15-826: Multimedia Databases and Data Mining

Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.

Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

SCS CMU Joint Work by Hanghang Tong, Spiros Papadimitriou, Jimeng Sun, Philip S. Yu, Christos Faloutsos Speaker: Hanghang Tong Aug , 2008, Las Vegas.

Principal Component Analysis

Adaptive Data Collection Strategies for Lifetime-Constrained Wireless Sensor Networks Xueyan Tang Jianliang Xu Sch. of Comput. Eng., Nanyang Technol. Univ.,

Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Context Compression: using Principal Component Analysis for Efficient Wireless Communications Christos Anagnostopoulos & Stathes Hadjiefthymiades Pervasive.

Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.

Overview Of Clustering Techniques D. Gunopulos, UCR.

Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.

1 Toward Sophisticated Detection With Distributed Triggers Ling Huang* Minos Garofalakis § Joe Hellerstein* Anthony Joseph* Nina Taft § *UC Berkeley §

Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.

A Multiresolution Symbolic Representation of Time Series

Privacy Preservation for Data Streams Feifei Li, Boston University Joint work with: Jimeng Sun (CMU), Spiros Papadimitriou, George A. Mihaila and Ioana.

Energy-efficient Self-adapting Online Linear Forecasting for Wireless Sensor Network Applications Jai-Jin Lim and Kang G. Shin Real-Time Computing Laboratory,

Statistical Methods for long-range forecast By Syunji Takahashi Climate Prediction Division JMA.

CMU SCS Data Mining in Streams and Graphs Christos Faloutsos CMU.

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Traffic modeling and Prediction ----Linear Models

Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.

Fault Diagnosis System for Wireless Sensor Networks Praharshana Perera Supervisors: Luciana Moreira Sá de Souza Christian Decker.

Cut-And-Stitch: Efficient Parallel Learning of Linear Dynamical Systems on SMPs Lei Li Computer Science Department School of Computer Science Carnegie.

InteMon: Intelligent monitoring system for large clusters Evan Hoke, Jimeng Sun and Christos Faloutsos.

Additive Data Perturbation: data reconstruction attacks.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Byoung-Kee Yi N.D.Sidiropoulos Theodore Johnson 國立雲林科技大學 National.

Wolf-Gerrit Früh Christina Skittides With support from SgurrEnergy Preliminary assessment of wind climate fluctuations and use of Dynamical Systems Theory.

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos.

Forecasting to account for seasonality Regularly repeating movements that can be tied to recurring events (e.g. winter) in a time series that varies around.

AutoPlait: Automatic Mining of Co-evolving Time Sequences Yasuko Matsubara (Kumamoto University) Yasushi Sakurai (Kumamoto University) Christos Faloutsos.

Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.

Gap-filling and Fault-detection for the life under your feet dataset.

Lei Li Computer Science Department Carnegie Mellon University Pre Proposal Time Series Learning completed work 11/27/2015.

Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.

Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun Ph.D.Thesis Proposal May 15, 2006.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

D YNA MM O : M INING AND S UMMARIZATION OF C OEVOLVING S EQUENCES WITH M ISSING V ALUES Lei Li joint work with Christos Faloutsos, James McCann, Nancy.

Facets: Fast Comprehensive Mining of Coevolving High-order Time Series Hanghang TongPing JiYongjie CaiWei FanQing He Joint Work by Presenter:Wei Fan.

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

Arizona State University1 Fast Mining of a Network of Coevolving Time Series Wei FanHanghang TongPing JiYongjie Cai.

MBF1413 | Quantitative Methods Prepared by Dr Khairul Anuar 8: Time Series Analysis & Forecasting – Part 1

REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.

SCS CMU Speaker Hanghang Tong Colibri: Fast Mining of Large Static and Dynamic Graphs Speaking Skill Requirement.

11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.

Enabling Real Time Alerting through streaming pattern discovery Chengyang Zhang Computer Science Department University of North Texas 11/21/2016 CRI Group.

Carnegie Mellon School of Computer Science Forecasting with Cyber-physical Interactions in Data Centers Lei Li PDL Seminar 9/28/2011.

Forecasting with Cyber-physical Interactions in Data Centers (part 3)

Non-linear Mining of Competing Local Activities

Pre Proposal Time Series Learning completed work

Kijung Shin1 Mohammad Hammoud1

Overview Of Clustering Techniques

Jimeng Sun · Charalampos (Babis) E

A Framework for Clustering Evolving Data Streams

Online Analytical Processing Stream Data: Is It Feasible?

Lecture 16. Classification (II): Practical Considerations

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University

© January 16http:// Motivation Co-evolving time series (data streams) appear in many different applications — e.g.: Disk access traffic in network clusters Internet flow traffic in a network Temperatures in a large building Chlorine concentration in water distribution network Values are typically correlated Would be very useful if we could summarize them on the fly

© January 16http:// Example water distribution network normal operation Phase 1Phase 2Phase 3 : : : chlorine concentrations sensors near leak sensors away from leak time

© January 16http:// Discover “hidden” (latent) variables for: Summarization of main trends for users Efficient forecasting, spotting outliers/anomalies Incremental, real-time computation Limited memory requirements Goals

© January 16http:// Phase 1Phase 2Phase 3 : : : Example: chlorine measurements water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak

© January 16http:// Phase 1 k = 1 Example: hidden variable actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends Phase 1 : : : chlorine concentrations

© January 16http:// Example: hidden variable tracking chlorine concentrations Phase 1 Phase 2 actual measurements (n streams) k hidden variable(s) k = 2 : : : We would like to discover a few “hidden (latent) variables” that summarize the key trends

© January 16http:// Example: hidden variable tracking chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1 : : : We would like to discover a few “hidden (latent) variables” that summarize the key trends

© January 16http:// Method outline Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

© January 16http:// 1. How to capture correlations? 20 o C 30 o C Temperature T 1 First sensor time

© January 16http:// 1. How to capture correlations? First sensor Second sensor 20 o C 30 o C Temperature T 2 time

© January 16http:// 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 1 Correlations: Let’s take a closer look at the first three value-pairs… Temperature T 2

© January 16http:// 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 2 Temperature T 1 First three lie (almost) on a line in the space of value- pairs…  O(n) numbers for the slope, and  One number for each value-pair (offset on line) offset = “hidden variable” time=1 time=2 time=3

© January 16http:// 1. How to capture correlations 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 Other pairs also follow the same pattern: they lie (approximately) on this line

© January 16http:// Method outline Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

© January 16http:// From hidden variables Experiments: chlorine concentration 166 streams 2 hidden variables (~4% error) Measurements Reconstruction [CMU Civil Engineering] from sensor

© January 16http:// Experiments: c hlorine concentration hidden variables [CMU Civil Engineering] Both capture global, periodic pattern Second: ~ first, but “phase-shifted” Can express any “phase-shift”…

© January 16http:// Conclusion Many settings with hundreds of streams, but Stream values are, by nature, related We proposed a method to discover hidden variables as summarization of main trends for users require only incremental computation without buffering of any past data Future work: Apply on more applications: e.g, performance monitoring for storage system, network system.

© January 16http:// Related work Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01] Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004]

© January 16http:// Experiments: Light measurements 54 sensors 2-4 hidden variables (~6% error) measurement reconstruction

© January 16http:// Experiments: Light measurements 1 & 2: main trend (as before) 3 & 4: potential anomalies and outliers hidden variables intermittent

© January 16http:// Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?

© January 16http:// 2. Incremental update error 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error New value

© January 16http:// 2. Incremental update error 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude  O(n) time New value

© January 16http:// 2. Incremental update 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude

© January 16http:// Stream correlations Principal Component Analysis (PCA) The “line” is the first principal component (PC) vector This line is optimal: it minimizes the sum of squared projection errors

© January 16http:// 2. Incremental update Given number of hidden variables k Assuming k is known We know how to update the slope (detailed equations in paper) For each new point x and for i = 1, …, k : y i := w i T x(proj. onto w i ) d i  d i + y i 2 (energy  i-th eigenval.) e i := x – y i w i (error) w i  w i + (1/d i ) y i e i (update estimate) x  x – y i w i (repeat with remainder) y1y1 w1w1 x e1e1 w 1 updated

© January 16http:// Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust k, the number of hidden variables?

© January 16http:// T3T3 3. Number of hidden variables If we had three sensors with similar measurements Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space T1T1 T2T2 value-tuple space

© January 16http:// T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation T1T1 T2T2 value-tuple space

© January 16http:// T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation But a plane will do (two hidden variables, k = 2) T1T1 T2T2 value-tuple space

© January 16http:// Number of hidden variables (PCs) Keep track of energy maintained by approximation with k variables (PCs): Reconstruction accuracy, w.r.t. total squared error Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold If below 95%, k  k  1 If above 98%, k  k  1