Download presentation
Presentation is loading. Please wait.
1
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura
2
2 Abstract Algorithm for estimating aggregate functions over a “sliding window” of the N most recent data items in one or more streams.
3
3 Single stream The first E- approximation scheme for number of 1’s in a sliding window. The first E- approximation scheme for the sum of integers in [0..R] in a sliding window. Both algorithms are optimal in worst case time and space. Both algorithms are deterministic
4
4 Distributed Streams The first randomized E-approximation scheme for the number of 1’s in a sliding window over the union of distributed streams.
5
5 Usage Network Monitoring Data Warehousing Telecommunications Sensor Networks
6
6 Multiple Data Source - Distributed Stream Model Only the most recent data is important - “Sliding Window”
7
7 The Goal in the algorithms Approximating a function F while minimizing : 1. The total memory 2. The time take by each party to process a data item 3. The time to produce an estimate - query time
8
8 Definition 1- An -approximation scheme for a quantity X A randomized procedure that, given any positive <1 and <1, compute an estimate : -approximate :An estimate whose worst case relative error is at most
9
9 An Example for Basic Counting Problem
10
10 Algorithms for Distributed Stream Each party observes only its own stream Each party communicates with other parties only when estimate is requested Each party sends a message to a Referee who computes the estimate
11
11 The Idea Storing a wave consisting of many random samples of the stream. Samples that contain only the recent items are sampled at a high probability, while those containing old items are sampled at a lower probability
12
12 Contributions Introducing a data structures called waves Presenting the first E -approximation scheme for Basic Counting. Presenting the first E -approximation scheme for the sum of integers in [0..R]. Both optimal in worst case space, processing time and query time.
13
13 Contributions Presenting the first randomized -approximation for the number of 1’s in a sliding window over the union of distributed streams
14
14 Related Work From the paper of Datar et al : Using Exponential Histogram data base
15
15 Exponential Histogram Maintain more information about recently seen items, less about old items. k0 most recent 1’s are assigned to individual bucket The K1 next most recent 1’s are assigned to bucket size 2. The K2 next most recent 1’s are assigned to bucket size 4. So on until last N items are assigned to some bucket
16
16 Exponential Histogram Each ki is either or The last bucket is discarded if its position no longer falls within the window If the new item is a 1, it is assigned to a new bucket of size 1. If this make, then the two least recent buckets of size 1 are merged to form a bucket of size 2. If k1 in now too large, the two least recent buckets of size 2 are merged So on resulting in a cascading of up to log N bucket merges in the worst case. The approach using waves avoids this cascading
17
17 The Basic Wave Assumption : is an integer. Counters: 1. pos - the current length of stream 2. rank - the current number of 1’s in the stream. The wave contains the position of the recent 1’s in the stream, arranged at different “levels”. For i=1,2,..,l-1, level i contains the positions of the most recent 1- bits whose 1-rank is a multiple of
18
18 An Example for Basic Wave The crest of the wave is always over the largest 1-rank N=48, 1/ E =3, l=5
19
19 Estimation Steps: Let s=max(0,pos-n+1) {estimation number of 1’s in [s,pos]} Let p1 be the maximum position less than s, and p2 the minimum position greater/equal then s. Let r1 and r2 be the rank-1 of p1 and p2 respectively. Return = rank-r+1 where r= r2 if r2- r1 =1 otherwise r=(r1+r2)/2
20
20 LEMMA 1 The procedure returns an estimate that is within a relative error of E of the actual number of 1’s in the window.
21
21 Proof Let j be the smallest numbered level containing position p1. By returning the midpoint of the range [r1,r2], we guarantee that the absolute error is at most (r2- r1)/2 There is at most a gap between r1 and its next larger position r2. Thus the absolute error in our estimate is at most Let r3 be the earliest 1-rank at level j-1. r3> r1, r3>=r2. by definition
22
22 Improvement Use modulo N’ counters for pos and rank, store the positions in the wave as modulo N’ numbers - Take only log N’ bits. Keep track of both the largest 1-rank discarded (r1) and the smallest 1- rank (r2) still in the wave - Number of 1’s answer in O(1). Instead of storing a single position in multiple levels, store each position only at its maximal level.
23
23 Improvement
24
24 Improvement The positions at each level are stored in a fixed length queue so that each time new position is added, the position at the end of the queue is removed. Maintaining a doubly link list of the position in the wave in increasing order. By storing the difference between consecutive positions instead of the absolute positions - reduce the space from to
25
25 The deterministic wave algorithm Upon receiving a stream bit b: 1.Increment pos (modulo N’=2N) 2.If the head(p,r) of the linked list L has expired (p<=pos-N), then discard it from L and from its queue, and store r as the largest 1-rank discarded 3.If b=1 then do: (a)Increment rank, and determine the corresponding wave level j, the largest j such that rank is a multiple of (b)If the level j queue is full,discard the tail of the queue and splice it out of L (c)Add(pos,rank) to the head of the level j queue and the tail of L
26
26 Answering a query for a sliding window of size N: 1. Let r1 the largest 1-rank discarded. (If no such r1, return rank as exact answer.) Let r2 be 1-rank at the head of the linked list L. (If L is empty, return 0). 2. Return rank-r+1, where r=r2 if r2-r1=1 and otherwise r=(r1+r2)/2
27
27 Space - Process time for each item - O(1) Estimate time - O(1) In related work (Datar et al) Space - Process time for each item - O(log( E N))
28
28 Sum of Bounded Integers The sum over a sliding window can range from 0 to NR. Let N’ be smallest power of 2 greater than/equal to 2RN. Counters(modulo N’): pos - the current length total - the running sum l=log(2 E NR) levels. Storing triple for each item (p,v,z) v-the value for the data item z-the partial sum trough this item
29
29 The answer for query is the midpoint of the interval [total-z2+v2,total-z1)
30
30 The Algorithm for the sum of last N items in a data stream Upon receiving a stream value v between 0 to R: 1.Increment pos (modulo N’=2N) 2.If the head(p,v’,z) of the linked list L has expired (p<=pos-N), then discard it from L and from its queue, and store z as the largest partial sum discarded 3.If v>0 then do: (a)Determine the largest j such that some number in (total,total+v) is a multiple of Add v to total. (b)If the level j queue is full,discard the tail of the queue and splice it out of L (c)Add(pos,v,total) to the head of the level j queue and the tail of L
31
31 Step 3a The desired wave level is the largest position j such that some number y in the interval (total,total+v] has 0’s in all positions less than j. y-1 and y differ in bit position j. If bit j changes from 1 to 0 at any point in [total,total+v],then j is not the largest j is the position of the most-significant bit that is 0 in total and 1 in total+v. j is the most -significant bit that is 1 in bitwise xor between total and total+v
32
32 Answering a query for a sliding window of size N : 1. Let z1 be the largest partial sum discarded from L. (If no such z1, return total as exact answer.) Let (pos,v2,z2) be the head of the linked list L. (If L is empty, return 0). 2. Return [total - (z1+z2-v2)/2]
33
33 Space -O(1/ E (logN+logR)) memory word of O(logN+logR) Process time for each item - O(1) Estimate time - O(1) In related work (Datar et al) Space - O(1/ E (logN+logR)) buckets of logN+log(logN+logR) Process time for each item - O(logN+logR)
34
34 Distributed Streams Tree definitions for sliding window over a collection of t>1 distributed stream: 1. Seeking the total number of 1’s in the last N items in each of the t streams (tN items in total) 2. A single logical stream has been split arbitrarily among the parties. Each party receives items that include a sequence number in the logical stream. Seeking the total number of 1’s in the last N items in the logical stream. 3.Seeking the total number of 1’s in the last N items in the position-wise union of the t streams
35
35 Solution for First Scenario : Applying single stream algorithm to each stream. To answer a query, each party sends its count to the Referee. The Referee sums the answers. Because each individual count is within E relative error, so is the total.
36
36 Solution for Second Scenario : To answer a query, each party sends its wave to the Referee. The Referee computes the maximum sequence number over all the parties use each wave to obtain an estimate over the resulting window, and sum the result. Because each individual count is within E relative error, so is the total.
37
37 Randomized Waves Contains the positions of the recent 1’s in the data stream, stored at different levels. Each level i contains the most recently selected positions of the 1-bits, where a position is selected into level i with probability The deterministic wave select 1 out of every 1-bits at regular interval. A randomized wave selects an expected 1 out of every 1-bits random interval. The randomize wave retains more position per level.
38
38 The Basic Randomized Wave Let N’ be the power of 2 that is at least 2N Let d=logN’ Let E <1 be the desired error probability Each Party Pj maintains a basic randomized wave for its stream consisting of d+1 queues, Qj(0),..,Qj(d), one for each level. Using a psedo-random hash function h to map positions to levels, according to exponential distribution
39
39 The Steps for Maintaining the Randomized Wave: Party Pj, upon receiving a stream bit b: 1.Increment pos (modulo N’=2N) 2.Discard any position p in the tail of a queue that has expired (p<=pos-N) 3.If b=1 then for l= 0,..,h(pos) do: (a) If the level l queue Qj(l) is full, then discard the tail of Qj(l) (b) Add pos to the head of Qj(l). The sample for each level, stored in a queue, contains the most recent position selected into the level. (c=36)
40
40 Consider a queue Qj(l) contains all the 1-bitwise the interval [I,pos] whose position i. Then Qj(l) contains all the 1-bits in the interval [i,pos] whose positions hash to a value greater than equal to l. As we move from level l to l+1, the range may increase. The queues at lower numbered levels may have ranges that fail to contain the window, but as we move to higher levels, we will find a level whose contains the window
41
41 Answering a query for a sliding window of size n<=N After each party has observed pos bits: 1. Each party j sends its wave, {Qj(0),..,Qj(logN’))}, to the Referee, let s=max(0,pos-n+1). Then W=[s,pos] is the desired window. 2.For j=1,..,t let lj be the minimum level such that the tail of Qj(lj) is a position p<=s. 3.Let l*=max{lj},j=0,..,t. Let U be the union of all positions in Q1(l*),..Qt(l*). 4. Return
42
42 The algorithm returns an estimate for Union Counting Problem for any sliding window of size n<=N that is within a relative error E with probability greater than 2/3 space -
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.