1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Solving connectivity problems parameterized by treewidth in single exponential time Marek Cygan, Marcin Pilipczuk, Michal Pilipczuk Jesper Nederlof, Dagstuhl.
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Mining Data Streams.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Algorithms for Precomputing Constrained Widest Paths and Multicast Trees Paper by Stavroula Siachalou and Leonidas Georgiadis Presented by Jeremy Witmer.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Computing and Communicating Functions over Sensor Networks A.Giridhar and P. R. Kumar Presented by Srikanth Hariharan.
Network Aware Resource Allocation in Distributed Clouds.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Lars Arge Presented by Or Ozery. I/O Model Previously defined: N = # of elements in input M = # of elements that fit into memory B = # of elements per.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
Competitive Queue Policies for Differentiated Services Seminar in Packet Networks1 Competitive Queue Policies for Differentiated Services William.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Data Structure Introduction.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Data Stream Algorithms Lower Bounds Graham Cormode
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Internal and External Sorting External Searching
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Advanced Sorting 7 2  9 4   2   4   7
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
Dynamic Hashing (Chapter 12)
A paper on Join Synopses for Approximate Query Answering
Temporal Indexing MVBT.
Streaming & sampling.
Analysis and design of algorithm
Spatial Online Sampling and Aggregation
CPU Scheduling G.Anuradha
Randomized Algorithms CS648
Range-Efficient Computation of F0 over Massive Data Streams
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Heavy Hitters in Streams and Sliding Windows
Approximation and Load Shedding Sampling Methods
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura

2 Abstract Algorithm for estimating aggregate functions over a “sliding window” of the N most recent data items in one or more streams.

3 Single stream The first E- approximation scheme for number of 1’s in a sliding window. The first E- approximation scheme for the sum of integers in [0..R] in a sliding window. Both algorithms are optimal in worst case time and space. Both algorithms are deterministic

4 Distributed Streams The first randomized E-approximation scheme for the number of 1’s in a sliding window over the union of distributed streams.

5 Usage Network Monitoring Data Warehousing Telecommunications Sensor Networks

6 Multiple Data Source - Distributed Stream Model Only the most recent data is important - “Sliding Window”

7 The Goal in the algorithms Approximating a function F while minimizing : 1. The total memory 2. The time take by each party to process a data item 3. The time to produce an estimate - query time

8 Definition 1- An -approximation scheme for a quantity X A randomized procedure that, given any positive <1 and <1, compute an estimate : -approximate :An estimate whose worst case relative error is at most

9 An Example for Basic Counting Problem

10 Algorithms for Distributed Stream Each party observes only its own stream Each party communicates with other parties only when estimate is requested Each party sends a message to a Referee who computes the estimate

11 The Idea Storing a wave consisting of many random samples of the stream. Samples that contain only the recent items are sampled at a high probability, while those containing old items are sampled at a lower probability

12 Contributions Introducing a data structures called waves Presenting the first E -approximation scheme for Basic Counting. Presenting the first E -approximation scheme for the sum of integers in [0..R]. Both optimal in worst case space, processing time and query time.

13 Contributions Presenting the first randomized -approximation for the number of 1’s in a sliding window over the union of distributed streams

14 Related Work From the paper of Datar et al : Using Exponential Histogram data base

15 Exponential Histogram Maintain more information about recently seen items, less about old items. k0 most recent 1’s are assigned to individual bucket The K1 next most recent 1’s are assigned to bucket size 2. The K2 next most recent 1’s are assigned to bucket size 4. So on until last N items are assigned to some bucket

16 Exponential Histogram Each ki is either or The last bucket is discarded if its position no longer falls within the window If the new item is a 1, it is assigned to a new bucket of size 1. If this make, then the two least recent buckets of size 1 are merged to form a bucket of size 2. If k1 in now too large, the two least recent buckets of size 2 are merged So on resulting in a cascading of up to log N bucket merges in the worst case. The approach using waves avoids this cascading

17 The Basic Wave Assumption : is an integer. Counters: 1. pos - the current length of stream 2. rank - the current number of 1’s in the stream. The wave contains the position of the recent 1’s in the stream, arranged at different “levels”. For i=1,2,..,l-1, level i contains the positions of the most recent 1- bits whose 1-rank is a multiple of

18 An Example for Basic Wave The crest of the wave is always over the largest 1-rank N=48, 1/ E =3, l=5

19 Estimation Steps: Let s=max(0,pos-n+1) {estimation number of 1’s in [s,pos]} Let p1 be the maximum position less than s, and p2 the minimum position greater/equal then s. Let r1 and r2 be the rank-1 of p1 and p2 respectively. Return = rank-r+1 where r= r2 if r2- r1 =1 otherwise r=(r1+r2)/2

20 LEMMA 1 The procedure returns an estimate that is within a relative error of E of the actual number of 1’s in the window.

21 Proof Let j be the smallest numbered level containing position p1. By returning the midpoint of the range [r1,r2], we guarantee that the absolute error is at most (r2- r1)/2 There is at most a gap between r1 and its next larger position r2. Thus the absolute error in our estimate is at most Let r3 be the earliest 1-rank at level j-1. r3> r1, r3>=r2. by definition

22 Improvement Use modulo N’ counters for pos and rank, store the positions in the wave as modulo N’ numbers - Take only log N’ bits. Keep track of both the largest 1-rank discarded (r1) and the smallest 1- rank (r2) still in the wave - Number of 1’s answer in O(1). Instead of storing a single position in multiple levels, store each position only at its maximal level.

23 Improvement

24 Improvement The positions at each level are stored in a fixed length queue so that each time new position is added, the position at the end of the queue is removed. Maintaining a doubly link list of the position in the wave in increasing order. By storing the difference between consecutive positions instead of the absolute positions - reduce the space from to

25 The deterministic wave algorithm Upon receiving a stream bit b: 1.Increment pos (modulo N’=2N) 2.If the head(p,r) of the linked list L has expired (p<=pos-N), then discard it from L and from its queue, and store r as the largest 1-rank discarded 3.If b=1 then do: (a)Increment rank, and determine the corresponding wave level j, the largest j such that rank is a multiple of (b)If the level j queue is full,discard the tail of the queue and splice it out of L (c)Add(pos,rank) to the head of the level j queue and the tail of L

26 Answering a query for a sliding window of size N: 1. Let r1 the largest 1-rank discarded. (If no such r1, return rank as exact answer.) Let r2 be 1-rank at the head of the linked list L. (If L is empty, return 0). 2. Return rank-r+1, where r=r2 if r2-r1=1 and otherwise r=(r1+r2)/2

27 Space - Process time for each item - O(1) Estimate time - O(1) In related work (Datar et al) Space - Process time for each item - O(log( E N))

28 Sum of Bounded Integers The sum over a sliding window can range from 0 to NR. Let N’ be smallest power of 2 greater than/equal to 2RN. Counters(modulo N’): pos - the current length total - the running sum l=log(2 E NR) levels. Storing triple for each item (p,v,z) v-the value for the data item z-the partial sum trough this item

29 The answer for query is the midpoint of the interval [total-z2+v2,total-z1)

30 The Algorithm for the sum of last N items in a data stream Upon receiving a stream value v between 0 to R: 1.Increment pos (modulo N’=2N) 2.If the head(p,v’,z) of the linked list L has expired (p<=pos-N), then discard it from L and from its queue, and store z as the largest partial sum discarded 3.If v>0 then do: (a)Determine the largest j such that some number in (total,total+v) is a multiple of Add v to total. (b)If the level j queue is full,discard the tail of the queue and splice it out of L (c)Add(pos,v,total) to the head of the level j queue and the tail of L

31 Step 3a The desired wave level is the largest position j such that some number y in the interval (total,total+v] has 0’s in all positions less than j. y-1 and y differ in bit position j. If bit j changes from 1 to 0 at any point in [total,total+v],then j is not the largest j is the position of the most-significant bit that is 0 in total and 1 in total+v. j is the most -significant bit that is 1 in bitwise xor between total and total+v

32 Answering a query for a sliding window of size N : 1. Let z1 be the largest partial sum discarded from L. (If no such z1, return total as exact answer.) Let (pos,v2,z2) be the head of the linked list L. (If L is empty, return 0). 2. Return [total - (z1+z2-v2)/2]

33 Space -O(1/ E (logN+logR)) memory word of O(logN+logR) Process time for each item - O(1) Estimate time - O(1) In related work (Datar et al) Space - O(1/ E (logN+logR)) buckets of logN+log(logN+logR) Process time for each item - O(logN+logR)

34 Distributed Streams Tree definitions for sliding window over a collection of t>1 distributed stream: 1. Seeking the total number of 1’s in the last N items in each of the t streams (tN items in total) 2. A single logical stream has been split arbitrarily among the parties. Each party receives items that include a sequence number in the logical stream. Seeking the total number of 1’s in the last N items in the logical stream. 3.Seeking the total number of 1’s in the last N items in the position-wise union of the t streams

35 Solution for First Scenario : Applying single stream algorithm to each stream. To answer a query, each party sends its count to the Referee. The Referee sums the answers. Because each individual count is within E relative error, so is the total.

36 Solution for Second Scenario : To answer a query, each party sends its wave to the Referee. The Referee computes the maximum sequence number over all the parties use each wave to obtain an estimate over the resulting window, and sum the result. Because each individual count is within E relative error, so is the total.

37 Randomized Waves Contains the positions of the recent 1’s in the data stream, stored at different levels. Each level i contains the most recently selected positions of the 1-bits, where a position is selected into level i with probability The deterministic wave select 1 out of every 1-bits at regular interval. A randomized wave selects an expected 1 out of every 1-bits random interval. The randomize wave retains more position per level.

38 The Basic Randomized Wave Let N’ be the power of 2 that is at least 2N Let d=logN’ Let E <1 be the desired error probability Each Party Pj maintains a basic randomized wave for its stream consisting of d+1 queues, Qj(0),..,Qj(d), one for each level. Using a psedo-random hash function h to map positions to levels, according to exponential distribution

39 The Steps for Maintaining the Randomized Wave: Party Pj, upon receiving a stream bit b: 1.Increment pos (modulo N’=2N) 2.Discard any position p in the tail of a queue that has expired (p<=pos-N) 3.If b=1 then for l= 0,..,h(pos) do: (a) If the level l queue Qj(l) is full, then discard the tail of Qj(l) (b) Add pos to the head of Qj(l). The sample for each level, stored in a queue, contains the most recent position selected into the level. (c=36)

40 Consider a queue Qj(l) contains all the 1-bitwise the interval [I,pos] whose position i. Then Qj(l) contains all the 1-bits in the interval [i,pos] whose positions hash to a value greater than equal to l. As we move from level l to l+1, the range may increase. The queues at lower numbered levels may have ranges that fail to contain the window, but as we move to higher levels, we will find a level whose contains the window

41 Answering a query for a sliding window of size n<=N After each party has observed pos bits: 1. Each party j sends its wave, {Qj(0),..,Qj(logN’))}, to the Referee, let s=max(0,pos-n+1). Then W=[s,pos] is the desired window. 2.For j=1,..,t let lj be the minimum level such that the tail of Qj(lj) is a position p<=s. 3.Let l*=max{lj},j=0,..,t. Let U be the union of all positions in Q1(l*),..Qt(l*). 4. Return

42 The algorithm returns an estimate for Union Counting Problem for any sliding window of size n<=N that is within a relative error E with probability greater than 2/3 space -