Download presentation
Presentation is loading. Please wait.
Published byKaley Colley Modified over 10 years ago
1
1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor: Prof P.K.Reddy
2
2 Goal n Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time- lagged, over sliding windows in real time. n Real time u high update frequency of the data stream u fixed response time, online Correlated!
3
3 Our approach n Naive algorithm u N : number of streams u w : size of sliding window u space O(N) and time O(N 2 w) VS space O(N 2 ) and time O(N 2 ). n Suppose that the streams are updated every second. u With a Pentium 4 PC, the exact computing method can only monitor 700 streams with a delay of 2 minutes. n Our Approach u Using Discrete Fourier Transform to approximate correlation u Using grid structure to filter out unlikely pairs u Our approach can monitor 10,000 streams with a delay of 2 minutes.
4
4 Roadmap n Goal n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work
5
5 Stream synoptic data structure n Three level time interval hierarchy u Time point, Basic window, Sliding window n Basic window (the key to our technique) u The computation for basic window i must finish by the end of the basic window i+1 u The basic window time is the system response time. n Digests Sliding window digests: sum DFT coefs Basic window digests: sum DFT coefs Sliding window Basic window Time point Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs
6
6 Roadmap n Motivation and Goal n Related work n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work
7
7 Synchronized Correlation Uses Basic Windows n Inner-product of aligned basic windows Stream x Stream y Sliding window Basic window
8
8 n Approximate with an orthogonal function family (e.g. DFT) n Inner product of the time series Inner product of the digests n The time and space complexity is reduced from O(b) to O(n). u b : size of basic window u n : size of the digests (n<<b) n e.g. 120 time points reduce to 4 digests Approximate Synchronized Correlation x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 f 1 (1) f 1 (2) f 1 (3) f 1 (4) f 1 (5) f 1 (6) f 1 (7) f 1 (8) f 2 (1) f 2 (2) f 2 (3) f 2 (4) f 2 (5) f 2 (6) f 2 (7) f 2 (8) f 3 (1) f 3 (2) f 3 (3) f 3 (4) f 3 (5) f 3 (6) f 3 (7) f 3 (8) y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8
9
9 Approximate lagged Correlation n Inner-product with unaligned windows n The time complexity is reduced from O(b) to O(n 2 ), as opposed to O(n) for synchronized correlation. sliding window
10
10 Roadmap n Motivation and Goal n Related work n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work
11
11 Grid Structure(to avoid checking all pairs) n The DFT coefficients yields a vector. High correlation => c loseness in the vector space u We can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs. x
12
12 Roadmap n Motivation and Goal n Related work n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work
13
13 Empirical Study n Response time u Exact (naïve method): T=k 0 bN 2
14
14 Empirical Study n DFT-grid: u Updating Digests: T 1 =k 1 bN u Detecting correlation:T 2 =k 2 N 2
15
15 Empirical Study(cont.) n Approximation errors u Larger size of digests, larger size of sliding window and smaller size of basic window give better approximation u The approximation errors are small for the stock data. n Precision: the quality of the grid structure
16
16 Roadmap n Motivation and Goal n Related work n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work
17
17 Future work n Algorithmic: u dynamic clustering of streams u outlier detection F a stream that becomes less correlated with the other streams in its cluster. n Applications: u Data-intensive application requiring correlation among many streams. u Network Traffic Monitoring: F The unusual high correlation between two links in a network might suggest some anomaly. u Medical Time Series: F The high correlation between the two region in the human brain during fMRI testing might suggest some functional connection. u Some domain specific definition of correlation might be more appropriate. F E.g., in fMRI time series, detrending before correlating.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.