Download presentation
Presentation is loading. Please wait.
Published byPrudence Cox Modified over 9 years ago
1
Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University
2
© January 16http://www.pdl.cmu.edu/2 Motivation Co-evolving time series (data streams) appear in many different applications — e.g.: Disk access traffic in network clusters Internet flow traffic in a network Temperatures in a large building Chlorine concentration in water distribution network Values are typically correlated Would be very useful if we could summarize them on the fly
3
© January 16http://www.pdl.cmu.edu/3 Example water distribution network normal operation Phase 1Phase 2Phase 3 : : : chlorine concentrations sensors near leak sensors away from leak time
4
© January 16http://www.pdl.cmu.edu/4 Discover “hidden” (latent) variables for: Summarization of main trends for users Efficient forecasting, spotting outliers/anomalies Incremental, real-time computation Limited memory requirements Goals
5
© January 16http://www.pdl.cmu.edu/5 Phase 1Phase 2Phase 3 : : : Example: chlorine measurements water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak
6
© January 16http://www.pdl.cmu.edu/6 Phase 1 k = 1 Example: hidden variable actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends Phase 1 : : : chlorine concentrations
7
© January 16http://www.pdl.cmu.edu/7 Example: hidden variable tracking chlorine concentrations Phase 1 Phase 2 actual measurements (n streams) k hidden variable(s) k = 2 : : : We would like to discover a few “hidden (latent) variables” that summarize the key trends
8
© January 16http://www.pdl.cmu.edu/8 Example: hidden variable tracking chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1 : : : We would like to discover a few “hidden (latent) variables” that summarize the key trends
9
© January 16http://www.pdl.cmu.edu/9 Method outline Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
10
© January 16http://www.pdl.cmu.edu/10 1. How to capture correlations? 20 o C 30 o C Temperature T 1 First sensor time
11
© January 16http://www.pdl.cmu.edu/11 1. How to capture correlations? First sensor Second sensor 20 o C 30 o C Temperature T 2 time
12
© January 16http://www.pdl.cmu.edu/12 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 1 Correlations: Let’s take a closer look at the first three value-pairs… Temperature T 2
13
© January 16http://www.pdl.cmu.edu/13 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 2 Temperature T 1 First three lie (almost) on a line in the space of value- pairs… O(n) numbers for the slope, and One number for each value-pair (offset on line) offset = “hidden variable” time=1 time=2 time=3
14
© January 16http://www.pdl.cmu.edu/14 1. How to capture correlations 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 Other pairs also follow the same pattern: they lie (approximately) on this line
15
© January 16http://www.pdl.cmu.edu/15 Method outline Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
16
© January 16http://www.pdl.cmu.edu/16 From hidden variables Experiments: chlorine concentration 166 streams 2 hidden variables (~4% error) Measurements Reconstruction [CMU Civil Engineering] from sensor
17
© January 16http://www.pdl.cmu.edu/17 Experiments: c hlorine concentration hidden variables [CMU Civil Engineering] Both capture global, periodic pattern Second: ~ first, but “phase-shifted” Can express any “phase-shift”…
18
© January 16http://www.pdl.cmu.edu/18 Conclusion Many settings with hundreds of streams, but Stream values are, by nature, related We proposed a method to discover hidden variables as summarization of main trends for users require only incremental computation without buffering of any past data Future work: Apply on more applications: e.g, performance monitoring for storage system, network system.
19
© January 16http://www.pdl.cmu.edu/19 Related work Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01] Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004]
20
© January 16http://www.pdl.cmu.edu/20 Experiments: Light measurements 54 sensors 2-4 hidden variables (~6% error) measurement reconstruction
21
© January 16http://www.pdl.cmu.edu/21 Experiments: Light measurements 1 & 2: main trend (as before) 3 & 4: potential anomalies and outliers hidden variables intermittent
22
© January 16http://www.pdl.cmu.edu/22 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
23
© January 16http://www.pdl.cmu.edu/23 2. Incremental update error 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error New value
24
© January 16http://www.pdl.cmu.edu/24 2. Incremental update error 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude O(n) time New value
25
© January 16http://www.pdl.cmu.edu/25 2. Incremental update 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude
26
© January 16http://www.pdl.cmu.edu/26 Stream correlations Principal Component Analysis (PCA) The “line” is the first principal component (PC) vector This line is optimal: it minimizes the sum of squared projection errors
27
© January 16http://www.pdl.cmu.edu/27 2. Incremental update Given number of hidden variables k Assuming k is known We know how to update the slope (detailed equations in paper) For each new point x and for i = 1, …, k : y i := w i T x(proj. onto w i ) d i d i + y i 2 (energy i-th eigenval.) e i := x – y i w i (error) w i w i + (1/d i ) y i e i (update estimate) x x – y i w i (repeat with remainder) y1y1 w1w1 x e1e1 w 1 updated
28
© January 16http://www.pdl.cmu.edu/28 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust k, the number of hidden variables?
29
© January 16http://www.pdl.cmu.edu/29 T3T3 3. Number of hidden variables If we had three sensors with similar measurements Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space T1T1 T2T2 value-tuple space
30
© January 16http://www.pdl.cmu.edu/30 T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation T1T1 T2T2 value-tuple space
31
© January 16http://www.pdl.cmu.edu/31 T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation But a plane will do (two hidden variables, k = 2) T1T1 T2T2 value-tuple space
32
© January 16http://www.pdl.cmu.edu/32 Number of hidden variables (PCs) Keep track of energy maintained by approximation with k variables (PCs): Reconstruction accuracy, w.r.t. total squared error Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold If below 95%, k k 1 If above 98%, k k 1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.