Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor:

Similar presentations


Presentation on theme: "1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor:"— Presentation transcript:

1 1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor: Prof P.K.Reddy

2 2 Goal n Given tens of thousands of high speed time series data streams, to detect high-value correlation, including synchronized and time- lagged, over sliding windows in real time. n Real time u high update frequency of the data stream u fixed response time, online Correlated!

3 3 Our approach n Naive algorithm u N : number of streams u w : size of sliding window u space O(N) and time O(N 2 w) VS space O(N 2 ) and time O(N 2 ). n Suppose that the streams are updated every second. u With a Pentium 4 PC, the exact computing method can only monitor 700 streams with a delay of 2 minutes. n Our Approach u Using Discrete Fourier Transform to approximate correlation u Using grid structure to filter out unlikely pairs u Our approach can monitor 10,000 streams with a delay of 2 minutes.

4 4 Roadmap n Goal n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work

5 5 Stream synoptic data structure n Three level time interval hierarchy u Time point, Basic window, Sliding window n Basic window (the key to our technique) u The computation for basic window i must finish by the end of the basic window i+1 u The basic window time is the system response time. n Digests Sliding window digests: sum DFT coefs Basic window digests: sum DFT coefs Sliding window Basic window Time point Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs Basic window digests: sum DFT coefs

6 6 Roadmap n Motivation and Goal n Related work n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work

7 7 Synchronized Correlation Uses Basic Windows n Inner-product of aligned basic windows Stream x Stream y Sliding window Basic window

8 8 n Approximate with an orthogonal function family (e.g. DFT) n Inner product of the time series Inner product of the digests n The time and space complexity is reduced from O(b) to O(n). u b : size of basic window u n : size of the digests (n<<b) n e.g. 120 time points reduce to 4 digests Approximate Synchronized Correlation x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 f 1 (1) f 1 (2) f 1 (3) f 1 (4) f 1 (5) f 1 (6) f 1 (7) f 1 (8) f 2 (1) f 2 (2) f 2 (3) f 2 (4) f 2 (5) f 2 (6) f 2 (7) f 2 (8) f 3 (1) f 3 (2) f 3 (3) f 3 (4) f 3 (5) f 3 (6) f 3 (7) f 3 (8) y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8

9 9 Approximate lagged Correlation n Inner-product with unaligned windows n The time complexity is reduced from O(b) to O(n 2 ), as opposed to O(n) for synchronized correlation. sliding window

10 10 Roadmap n Motivation and Goal n Related work n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work

11 11 Grid Structure(to avoid checking all pairs) n The DFT coefficients yields a vector. High correlation => c loseness in the vector space u We can use a grid structure and look in the neighborhood, this will return a super set of highly correlated pairs. x

12 12 Roadmap n Motivation and Goal n Related work n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work

13 13 Empirical Study n Response time u Exact (naïve method): T=k 0 bN 2

14 14 Empirical Study n DFT-grid: u Updating Digests: T 1 =k 1 bN u Detecting correlation:T 2 =k 2 N 2

15 15 Empirical Study(cont.) n Approximation errors u Larger size of digests, larger size of sliding window and smaller size of basic window give better approximation u The approximation errors are small for the stock data. n Precision: the quality of the grid structure

16 16 Roadmap n Motivation and Goal n Related work n StatStream u Data Structure u Correlation Approximation u Grid structure n Empirical study n Future work

17 17 Future work n Algorithmic: u dynamic clustering of streams u outlier detection F a stream that becomes less correlated with the other streams in its cluster. n Applications: u Data-intensive application requiring correlation among many streams. u Network Traffic Monitoring: F The unusual high correlation between two links in a network might suggest some anomaly. u Medical Time Series: F The high correlation between the two region in the human brain during fMRI testing might suggest some functional connection. u Some domain specific definition of correlation might be more appropriate. F E.g., in fMRI time series, detrending before correlating.


Download ppt "1 StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time Pankaj Kumar Madhukar Rakesh Kumar Singh Puspendra Kumar Project Instructor:"

Similar presentations


Ads by Google