Download presentation
Presentation is loading. Please wait.
Published byRonald Casey Modified over 9 years ago
1
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim, Norway
2
2 Motivation Several settings where many deployed sensors measure some quantity—e.g.: – Traffic in a network – Temperatures in a large building – Chlorine concentration in water distribution network Values are typically correlated Would be very useful if we could summarize them on the fly
3
3 Motivation water distribution network normal operation May have hundreds of measurements, but it is unlikely they are completely unrelated! Phase 1Phase 2Phase 3 : : : chlorine concentrations sensors near leak sensors away from leak
4
4 Phase 1Phase 2Phase 3 : : : Motivation water distribution network normal operationmajor leak May have hundreds of measurements, but it is unlikely they are completely unrelated! chlorine concentrations sensors near leak sensors away from leak
5
5 Motivation actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends Phase 1 : : : chlorine concentrations Phase 1 k = 1
6
6 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 actual measurements (n streams) k hidden variable(s) k = 2 : : :
7
7 Motivation We would like to discover a few “hidden (latent) variables” that summarize the key trends chlorine concentrations Phase 1 Phase 2 Phase 3 actual measurements (n streams) k hidden variable(s) k = 1 : : :
8
8 Discover “hidden” (latent) variables for: – Summarization of main trends for users – Efficient forecasting, spotting outliers/anomalies Incremental, real-time computation Limited memory requirements Goals
9
9 Related work Stream mining Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering [Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE], [Lin, Vlachos, Keogh, Gunopulos / EDBT04], Classification [Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01] Piecewise approximations [Palpanas, Vlachos, Keogh, etal / ICDE 2004] Queries on streams [Dobra, Garofalakis, Gehrke, et al / SIGMOD02], [Madden, Franklin, Hellerstein, et al / OSDI02], [Considine, Li, Kollios, et al / ICDE04], [Hammad, Aref, Elmagarmid / SSDBM03] …
10
10 Overview Method outline Experiments
11
11 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
12
12 1. How to capture correlations? 20 o C 30 o C Temperature T 1 First sensor time
13
13 1. How to capture correlations? First sensor Second sensor 20 o C 30 o C Temperature T 2 time
14
14 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 1 Correlations: Let’s take a closer look at the first three value-pairs… Temperature T 2
15
15 20 o C30 o C 1. How to capture correlations 20 o C 30 o C Temperature T 2 Temperature T 1 First three lie (almost) on a line in the space of value-pairs… O(n) numbers for the slope, and One number for each value-pair (offset on line) offset = “hidden variable” time=1 time=2 time=3
16
16 1. How to capture correlations 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 Other pairs also follow the same pattern: they lie (approximately) on this line
17
17 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust the number of hidden variables?
18
18 2. Incremental update error 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error New value
19
19 2. Incremental update error 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude O(n) time New value
20
20 2. Incremental update 20 o C 30 o C 20 o C30 o C Temperature T 2 Temperature T 1 For each new point Project onto current line Estimate error Rotate line in the direction of the error and in proportion to its magnitude
21
21 Stream correlations Principal Component Analysis (PCA) The “line” is the first principal component (PC) vector This line is optimal: it minimizes the sum of squared projection errors
22
22 2. Incremental update Given number of hidden variables k Assuming k is known We know how to update the slope (detailed equations in paper) For each new point x and for i = 1, …, k : y i := w i T x(proj. onto w i ) d i d i + y i 2 (energy i-th eigenval.) e i := x – y i w i (error) w i w i + (1/d i ) y i e i (update estimate) x x – y i w i (repeat with remainder) y1y1 w1w1 x e1e1 w 1 updated
23
23 Stream correlations Step 1: How to capture correlations? Step 2: How to do it incrementally, when we have a very large number of points? Step 3: How to dynamically adjust k, the number of hidden variables?
24
24 T3T3 3. Number of hidden variables If we had three sensors with similar measurements Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space T1T1 T2T2 value-tuple space
25
25 T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation T1T1 T2T2 value-tuple space
26
26 T3T3 3. Number of hidden variables Assume one sensor intermittently gets stuck Now, no line can give a good approximation But a plane will do (two hidden variables, k = 2) T1T1 T2T2 value-tuple space
27
27 Number of hidden variables (PCs) Keep track of energy maintained by approximation with k variables (PCs): – Reconstruction accuracy, w.r.t. total squared error Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold – If below 95%, k k 1 – If above 98%, k k 1
28
28 Missing values 20 o C30 o C 20 o C 30 o C Temperature T 2 Temperature T 1 true values (pair) all possible value pairs (given only t 1 ) best guess (given correlations: intersection)
29
29 Forecasting ? Assume we want to forecast the next value for a particular stream (e.g. auto-regression) n streams
30
30 Forecasting Option 1: One complex model per stream – Next value = function of previous values on all streams – Captures correlations – Too costly! [ ~ O(n 3 ) ] + n streams
31
31 Forecasting Option 1: One complex model per stream Option 2: One simple model per stream – Next value = function of previous value on same stream – Worse accuracy, but maybe acceptable – But, still need n models + n streams
32
32 Forecasting n streams hidden variables k hidden vars k << n and already capture correlations + Only k simple models Efficiency & robustness
33
33 Time/space requirements Incremental PCA O(nk) space (total) and time (per tuple), i.e., Independent of # points (t) Linear w.r.t. # streams (n) Linear w.r.t. # hidden variables (k) In fact, Can be done in real time [demo]
34
34 Overview Method outline Experiments
35
35 Experiments Chlorine concentration 166 streams 2 hidden variables (~4% error) Measurements Reconstruction [CMU Civil Engineering]
36
36 Experiments Chlorine concentration hidden variables [CMU Civil Engineering] Both capture global, periodic pattern Second: ~ first, but “phase-shifted” Can express any “phase-shift”…
37
37 Experiments Light measurements 54 sensors 2-4 hidden variables (~6% error) measurement reconstruction
38
38 Experiments Light measurements 1 & 2: main trend (as before) 3 & 4: potential anomalies and outliers hidden variables intermittent
39
39 Experiments Missing values Correlations already captured by hidden variables Provide information about missing values – Quickly back on track, if mis-estimated reconstruct sensor 7 given everything else (via hidden variables) [CMU ECE]
40
40 Experiments Missing values Correlations already captured by hidden variables Provide information about missing values – Quickly back on track, if mis-estimated reconstruct sensor 8 given everything else (via hidden variables) [CMU ECE]
41
41 Wall-clock times time vs. stream size (t) time vs. #streams (n) time vs. #hid. vars (k) constant time per tuple and per stream time (sec) stream size (time ticks t) time (sec) # of streams (n)# of PCs (k)
42
42 Conclusion Many settings with hundreds of streams, but – Stream values are, by nature, related – In reality, there are only a few variables Discover hidden variables for – Summarization of main trends for users – Efficient forecasting, spotting outliers/anomalies Incremental, real time computation With limited memory
43
43 End Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.