Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University
© January 16http:// Motivation Co-evolving time series (data streams) appear in many different applications — e.g.: Disk access traffic in network clusters Internet flow traffic in a network Temperatures in a large building Chlorine concentration in water distribution network Values are typically correlated Would be very useful if we could summarize them on the fly
© January 16http:// Example water distribution network normal operation Phase 1Phase 2Phase 3 : : : chlorine concentrations sensors near leak sensors away from leak time
© January 16http:// Discover “hidden” (latent) variables for: Summarization of main trends for users Efficient forecasting, spotting outliers/anomalies Incremental, real-time computation Limited memory requirements Goals
© January 16http:// Phase 1Phase 2Phase 3 : : : Example: chlorine measurements water distribution network normal operationmajor leak chlorine concentrations sensors near leak sensors away from leak
© January 16http:// Phase 1 k = 1 Example: hidden variable actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends Phase 1 : : : chlorine concentrations
© January 16http:// Example: hidden variable tracking chlorine concentrations Phase 1 Phase 2 actual measurements (n streams) k hidden variable(s) k = 2 : : : We would like to discover a few “hidden (latent) variables” that summarize the key trends
© January 16http:// Example: hidden variable tracking chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1 : : : We would like to discover a few “hidden (latent) variables” that summarize the key trends
© January 16http:// Method outline Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
© January 16http:// 1. How to capture correlations? 20 o C 30 o C Temperature T 1 First sensor time
© January 16http:// 1. How to capture correlations? First sensor Second sensor 20 o C 30 o C Temperature T 2 time
© January 16http:// 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 1 Correlations: Let’s take a closer look at the first three value-pairs… Temperature T 2
© January 16http:// 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 2 Temperature T 1 First three lie (almost) on a line in the space of value- pairs… O(n) numbers for the slope, and One number for each value-pair (offset on line) offset = “hidden variable” time=1 time=2 time=3
© January 16http:// 1. How to capture correlations 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 Other pairs also follow the same pattern: they lie (approximately) on this line
© January 16http:// Method outline Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
© January 16http:// From hidden variables Experiments: chlorine concentration 166 streams 2 hidden variables (~4% error) Measurements Reconstruction [CMU Civil Engineering] from sensor
© January 16http:// Experiments: c hlorine concentration hidden variables [CMU Civil Engineering] Both capture global, periodic pattern Second: ~ first, but “phase-shifted” Can express any “phase-shift”…
© January 16http:// Conclusion Many settings with hundreds of streams, but Stream values are, by nature, related We proposed a method to discover hidden variables as summarization of main trends for users require only incremental computation without buffering of any past data Future work: Apply on more applications: e.g, performance monitoring for storage system, network system.
© January 16http:// Related work Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01] Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004]
© January 16http:// Experiments: Light measurements 54 sensors 2-4 hidden variables (~6% error) measurement reconstruction
© January 16http:// Experiments: Light measurements 1 & 2: main trend (as before) 3 & 4: potential anomalies and outliers hidden variables intermittent
© January 16http:// Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
© January 16http:// 2. Incremental update error 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error New value
© January 16http:// 2. Incremental update error 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude O(n) time New value
© January 16http:// 2. Incremental update 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude
© January 16http:// Stream correlations Principal Component Analysis (PCA) The “line” is the first principal component (PC) vector This line is optimal: it minimizes the sum of squared projection errors
© January 16http:// 2. Incremental update Given number of hidden variables k Assuming k is known We know how to update the slope (detailed equations in paper) For each new point x and for i = 1, …, k : y i := w i T x(proj. onto w i ) d i d i + y i 2 (energy i-th eigenval.) e i := x – y i w i (error) w i w i + (1/d i ) y i e i (update estimate) x x – y i w i (repeat with remainder) y1y1 w1w1 x e1e1 w 1 updated
© January 16http:// Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust k, the number of hidden variables?
© January 16http:// T3T3 3. Number of hidden variables If we had three sensors with similar measurements Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space T1T1 T2T2 value-tuple space
© January 16http:// T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation T1T1 T2T2 value-tuple space
© January 16http:// T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation But a plane will do (two hidden variables, k = 2) T1T1 T2T2 value-tuple space
© January 16http:// Number of hidden variables (PCs) Keep track of energy maintained by approximation with k variables (PCs): Reconstruction accuracy, w.r.t. total squared error Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold If below 95%, k k 1 If above 98%, k k 1