Download presentation
Presentation is loading. Please wait.
Published byEmmalee Voils Modified over 10 years ago
1
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University
2
Continuous Data Streams Data streams arise in a number of applications –IP packets in a network –Call records (telecom) –Cash register data (retail sales) –Sensor networks Large volumes of data Online processing Data is read once and discarded Memory is limited
3
Why Moving Windows? Timeliness matters –Old/obsolete data is not useful Scalability matters –Querying the entire history may be impractical Solution: restrict queries to a window of recent data –As new data arrives, old data expires –Addresses timeliness and scalability
4
Two Types of Windows Sequence-Based –The most recent n elements from the data stream –Assumes a (possibly implicit) sequence number for each element Timestamp-Based –All elements from the data stream in the last m units of time (e.g. last 1 week) –Assumes a (possibly implicit) arrival timestamp for each element Sequence-based is the focus for most of the talk
5
Sampling From a Data Stream Inputs: –Sample size k –Window size n >> k (alternatively, time duration m) –Stream of data elements that arrive online Output: –k elements chosen uniformly at random from the last n elements (alternatively, from all elements that have arrived in the last m time units) Goal: –maintain a data structure that can produce the desired output at any time upon request
6
A Simple, Unsatisfying Approach Choose a random subset X={x 1, …,x k }, X {0,1,…,n-1} The sample always consists of the non-expired elements whose indexes are equal to x 1, …,x k (modulo n) Only uses O(k) memory Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic Unsuitable for many real applications, particularly those with periodicity in the data
7
Another Simple Approach: Oversample As each element arrives remember it with probability p = ck/n log n; otherwise discard it Discard elements when they expire When asked to produce a sample, choose k elements at random from the set in memory Expected memory usage of O(k log n) Uses O(k log n) memory whp The algorithm can fail if less than k elements from a window are remembered; however whp this will not happen
8
Reservoir Sampling Classic online algorithm due to Vitter (1985) Maintains a fixed-size uniform random sample –Size of the data stream need not be known in advance Data structure: reservoir of k data elements As the ith data element arrives: –Add it to the reservoir with probability p = k/i, discarding a randomly chosen data element from the reservoir to make room –Otherwise (with probability 1-p) discard it
9
Why It Doesnt Work With Moving Windows Suppose an element in the reservoir expires Need to replace it with a randomly-chosen element from the current window However, in the data stream model we have no access to past data Could store the entire window but this would require O(n) memory
10
Chain-Sample Include each new element in the sample with probability 1/min(i,n) As each element is added to the sample, choose the index of the element that will replace it when it expires When the ith element expires, the window will be (i+1…i+n), so choose the index from this range Once the element with that index arrives, store it and choose the index that will replace it in turn, building a chain of potential replacements When an element is chosen to be discarded from the sample, discard its chain as well
11
Example 3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
12
Memory Usage of Chain-Sample Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x T(x) = The expected length of each chain is less than T(n) e 2.718 Expected memory usage is O(k) { 0 for x < 0 1 + 1/n [ Σ T(j)]for x 1 j<i
13
Memory Usage of Chain-Sample Chain consists of hops with lengths 1…n Chain of length j can be represented by partition of n into j ordered integer parts –j-1 hops with sum less than n plus a remainder Each such partition has probability n -j Number of such partitions is ( n ) < (ne/j) j Probability of any such partition is small [O(n -c )] when j = O(k log n) Uses O(k log n) memory whp j
14
Comparison of Algorithms Chain-sample is preferable to oversampling: –Better expected memory usage: O(k) vs. O(k log n) –Same high-probability memory bound of O(k log n) –No chance of failure due to sample size shrinking below k AlgorithmExpected High-Probability PeriodicO(k) OversampleO(k log n) Chain-SampleO(k)O(k log n)
15
Timestamp-Based Windows Window at time t consists of all elements whose arrival timestamp is at least t = t-m The number of elements in the window is not known in advance and may vary over time None of the previous algorithms will work –All require windows with a constant, known number of elements
16
Priority-Sample We describe priority-sample for k=1 Assign each element a randomly-chosen priority The element with the highest priority is the sample An element is ineligible if there is another element with a later timestamp and higher priority Only store eligible, non-expired elements
17
Memory Usage of Priority-Sample Imagine that the elements were stored in a treap totally ordered by arrival timestamp and heap- ordered by priority The eligible elements would represent the right spine of the treap We only store the eligible elements Therefore expected memory usage is O(log n), or O(k log n) for samples of size k O(k log n) is also an upper bound (whp)
18
Conclusion Our contributions: –Introduced the problem of maintaining a sample over a moving window from a data stream –Developed the Chain-Sample algorithm for this problem with sequence-based windows –Developed the Priority-Sample algorithm for this problem with timestamp-based windows Future work: –What else can be computed in sublinear space over moving windows on data streams? –For example: The next talk!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.