Download presentation
Presentation is loading. Please wait.
Published byMarilyn Collins Modified over 9 years ago
1
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences New York University SIGKDD 2003
2
Abstract Burst detection Find abnormal aggregates in data streams Sliding window In some applications, we want to monitor many sliding window sizes simultaneously. Brute force: O(n 2 ) Shifted Wavelet Tree: near linear time
3
Problem Statement For a time series x 1, x 2, …, x n, given a set of window sizes w 1, w 2, …, w m, an aggregate function F and threshold associated with each window size, f(w j ), j = 1, 2, …, m Monitoring elastics window aggregates of the time series is to find all the subsequences of all the window sizes such that the aggregate applied to the subsequences cross their window sizes' thresholds, i.e.
4
Wavelet Tree Haar Wavelet Tree Level 0: original time series Level 1: pair wise averages and differences of the adjacent data items at level 0 Level i: pair wise averages and differences on averages at level i - 1 The wavelet coefficients can represent the trend of the time series.
5
Wavelet coefficient → Aggregate Average and difference → Sum Problem: the windows at the same level are non-overlapping Wavelet Tree (cont.)
6
Shifted Wavelet Tree Add additional “line” of windows They can be maintained explicitly or implicitly.
7
Shifted Wavelet Tree (cont.) Any subsequence of length w, w ≦ 2 i is included in one of the windows at level i + 1 of the SWT. We say that windows with size w, 2 i -1 < w ≦ 2 i, are monitored by level i + 1 of the SWT. Level 3 Level 4 73
8
SWT Construction For each level i (i ≧ 1) Compute the pair wise aggregate (sum) for each two consecutive data items at level i - 1 Downsampling sampling every second item in the series of aggregates → the input for the higher level in the SWT O(n), n: time series length
9
Search for a Burst Given window size w ≦ 2 i, threshold f(w) Search in two stages The potential burst is detected at the level i + 1 in the SWT Detailed search in those subsequences of size 2 i with sum ≧ f(w) O(k), k: #alarms (output size)
10
Streaming Algorithm Assume that new data becomes available at every time unit. The set of window sizes are 2 L < w 1 < w 2 < … < w m < 2 U. Maintain the levels from L+2 to U+1 of the SWT that monitor those windows. Two methods Online algorithm Batch algorithm
11
Streaming Algorithm: Online Algorithm Whenever a new data item becomes available Update those 2(U - L) aggregates of the windows in the SWT. If the aggregate at level i exceeds δ i, perform a detailed search on those windows monitored by i. For level i, threshold δ i = min f(w j ), 2 i-2 < w j ≦ 2 i-1 Response time = one time unit
12
Streaming Algorithm: Batch Algorithm Maintain the aggregates at level L+1 The aggregate in the most recently completed window of level L+1 is updated every time unit. An aggregate of a window at the upper levels will not be computed until all the data in that window are available. Once an aggregate at a certain upper level is updated, we also check alarms for time intervals monitored by that level. Higher throughput, longer response time.
13
Other Aggregates The monitoring of many other aggregates based on elastic windows could benefit from our data structure, as long as the following conditions holds. 1. The aggregate F is monotonically increasing or decreasing with respect to the window. e.g. Max, Count → monotonically increasing Min → monotonically increasing 2. The alarm domain is one sided, that is, monotonic increasing → [threshold, ∞) monotonic decreasing → (-∞, threshold]
14
Extension to Two Dimensions The problem is to report the positions of spatial sliding windows (rectangle regions) having different sizes, within which the density exceeds some predefined threshold. Using the same techniques of SWT-1D. Wavelet Tree 2D Shifted Wavelet Tree 2D
15
Effectiveness Study Bursts of the number of times that countries were mentioned in the presidential speech of the state of the union.
16
A predefined sliding window size is insufficient. Bursts at large time scales are not necessarily reflected at smaller time scales. may be composed of many consecutive “bumps" Effectiveness Study (cont.)
17
Bursts in population distribution data (1990) Window sizes 1°x1°, 2°x2° and 5°x5° in Latitude/Longitude Effectiveness Study (cont.)
18
Performance Study Experiments on a 1.5GHz Pentium 4 PC with 512 MB of main memory running Windows 2000. Datasets The Gamma Ray data set 12 hours of data from a small region of the sky, where Gamma Ray bursts were actually reported The data are time series of the number of photons observed (events) every 0.1 second. Totally 19,015 events in this time series The NYSE TAQ Stock data set Tick-by-tick trading activities of the IBM stock between July 1st, 1998 and July 1st, 2002. 5,331,145 trading records (ticks) Each record contains trading time, trading price and trading volume.
19
Training threshold Use the first few hours of Gamma Ray data and the first year of Stock data as training data. For a window of size w, we compute the aggregates on the training data with sliding window of size w => → y f(w) = avg(→ y) + ξstd(→ y) Window sizes: 5, 10, …,5 * N w time units N w : #windows, varies from 5 to 50 Time units: 0.1 sec for the Gamma Ray data, and 1 min for the stock data. Performance Study (cont.)
20
The processing time of our algorithm is output-dependent. Performance Study (cont.)
21
Experiments on stock data Performance Study (cont.)
22
Use spread as aggregate function Performance Study (cont.)
23
Conclusion and Future Work This paper introduces elastic window model and demonstrates the desirability of the new model. A novel data structure for efficient detection of elastic bursts and other aggregates. Experiments show that our algorithm is faster than a brute force algorithm by several orders of magnitude. Future work A robust way of setting the thresholds Non-monotonic aggregates
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.