1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

Slides:



Advertisements
Similar presentations
Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter A. Tucker SIGMOD.
Advertisements

COMP9314Xuemin Continuously Maintaining Order Statistics Over Data Streams Lecture Notes COM9314.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Mining Data Streams.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efficient Computation of Equi-Depth Histograms.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Heavy hitter computation over data stream
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
On ‘Selection and Sorting with Limited Storage’ Graham Cormode Joint work with S. Muthukrishnan, Andrew McGregor, Amit Chakrabarti.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Computer Science and Engineering Loyalty-based Selection: Retrieving Objects That Persistently Satisfy Criteria Presented By: Zhitao Shen Joint work with.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
Space-Efficient Online Computation of Quantile Summaries Michael Greenwald & Sanjeev Khanna University of Pennsylvania Presented by nir levy.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
Data Mining: Concepts and Techniques Mining data streams
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Mining of Massive Datasets Ch4. Mining Data Streams
June 16, 2004 PODS 1 Approximate Counts and Quantiles over Sliding Windows Arvind Arasu, Gurmeet Singh Manku Stanford University.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
1.  RNN(q) – returns a set of data points that have the query point q as the nearest neighbor.  Advanced database applications: fixed wireless telephone.
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.
Clustering Data Streams A presentation by George Toderici.
REU 2009-Traffic Analysis of IP Networks Daniel S. Allen, Mentor: Dr. Rahul Tripathi Department of Computer Science & Engineering Data Streams Data streams.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Mining Data Streams (Part 1)
Frequency Counts over Data Streams
The Stream Model Sliding Windows Counting 1’s
Finding Frequent Items in Data Streams
Matrix Sketching over Sliding Windows
Streaming & sampling.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani
Approximate Frequency Counts over Data Streams
Tutorial 9 Suppose that a random sample of size 10 is drawn from a normal distribution with mean 10 and variance 4. Find the following probabilities:
Range-Efficient Computation of F0 over Massive Data Streams
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Heavy Hitters in Streams and Sliding Windows
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565

2 Streams Here, There, Everywhere! Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc, etc, etc.

3 Problem Definition  Data Stream Environment  One Pass  Data element is a value  Φ -quantile ( [0,1) ) The element with rank Ceiling ( Φ N) of an ordered sequence of N data elements.

4 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 t 11 t 12 t 13 t 14 t sort 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12 N = quantile returns element ranked 8 ( 0.5*16) which is quantile returns element ranked 12 (0.75*16) which is 10

5 3 Models  Data Stream Model Computing Φ-quantile for all the data items seen so far  Sliding Window Model Computing Φ-quantile against the N most recent elements in a data stream seen so far  n of N Model For any n of N, computing Φ-quantile among the n most recent elements in a data stream seen so far

6 Sliding Window Model … … Time Increases Current Time Window Size = N Most Recent N Elements

7 Sliding window model t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 t 11 t 12 t 13 t 14 t Window size = 12, 0.5-quantile returns 10 at time t quantile returns 6 at time t 15

8 n-of-N model t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 t 11 t 12 t 13 t 14 t N = 12, 0.5-quantile returns 8 at time t 11 for n = 8, 0.5-quantile returns 3 at time t 15 for n = 4

9 Applications - Sliding Window Model in Data Streams  Useful for Network Traffic Management, Sensor Data.  To find out Top Ranked Web pages from Most Recently accessed N pages  In the financial market, investors are often interested in finding out the most recent N bids.

10 Previous Work on Approximating Quantiles in One Scan of Data 1/є log²єN]  G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory [ 1/є log²єN]  G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. 1/є log єN] {GK Algorithm}  M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. [ 1/є log єN] {GK Algorithm}  GK Algorithm  MOST EFFICIENT OWING TO LEAST SPACE USAGE + does not require advance knowledge of N

11 Definitions   -Quantile: A  -quantile (  (0,1]) of an ordered sequence of N data elements is the element with rank  N .  Quantile Query: Given , find the data element with rank  N  among all elements in the stream. Variation: N recent elements (sliding window model).  (  -approximate): Find the element with rank r within the interval [r-  N, r+  N].

12 Computation of Quantile Summaries over Sliding Windows – 2 Methods  Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream, Xuemin Lin, Hongjun Lu, Jian Xu, Jeffrey Xu Yu, 2004 IEEE  Approximating frequency counts and quantiles using sliding window model, Arvind Arasu, Gurmeet Singh Manku,Stanford University, 2004

13 Computation of Quantile Summaries over Sliding Windows – LLXY04  GK Algorithm + Concept Of Aging (Computing quantiles over a Sliding Window of Most Recent N Elements)  Under sliding window model, a summary is maintained for the most recently seen N data elements.  Eliminate exact out-dated elements requires a space of O(N).

14 e-approximate  A quantile summary for a data sequence is e- approximate if, for any given rank r, it returns a value whose rank r’ is guaranteed to be within the interval [r - εN, r + εN ] Example : A data stream with 100 elements, 0.5 – quantile with ε= 0.1 returns a value v. The true rank of v is within [40,60]

15 Quantile Sketch  Data structure { (v i, r i –,r i + ) : 1 ≦ i ≦ m} A value v i is one of the element seen so far r i – is the lower bound on the rank of v i r i + is the upper bound on the rank of v i v i <= v i+1, for 1 ≦ i ≦ m - 1 r i – <= r i+1 –, for 1 ≦ i ≦ m – 1 r i – < =r i <= r i +, where r i is the rank of v i

16 Example t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 t 11 t 12 t 13 t 14 t Quantile sketch consisting of 6 tuples {(1,1,1), (2,2,9), (3,3,10), (5,4,10), (10,10,10), (12,16,16)}

17 e - approximate sketch  Theorem 1. r 1 + ≦ εN + 1, 2. r m – ≧ (1-ε)N, 3. for 2 ≦ i ≦ m,  Sketch S is e - approximate, That is for each Φ(0,1], there is a (v i, r i –,r i + ) in S such that

18 Query t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 t 11 t 12 t 13 t 14 t Quantile sketch consisting of 6 tuples ε= 0.25 {(1,1,1), (2,2,9), (3,3,10), (5,4,10), (10,10,10), (12,16,16)} 0.5 – quantile return the v i of rank 8, εN = 4 Find the first tuple to satisfy the rule, and return vi (4,4,10) => return 4

19 One-Pass summary for sliding windows  Continuously divide a stream into the buckets based on the arrival ordering of data elements  The capacity of each bucket is  For each bucket, we maintain an - approximate continuously by GK-algorithm Once a bucket is full its - approximate sketch is compressed into an - approximate sketch The oldest bucket is expired if currently the total number of elements is N+1

20 Current bucket the most recent N elements ….…. expired bucket Compressed - approximate sketch in each bucket GK Summary Technique

21 -approximate sketch Example N = 8, ε= 1, = Current bucket Expire Current bucket Full, compress -approximate sketch

22 Compress  Compress an - approximate sketch into e- approximate sketch  Memory space is most  Why not use - approximate sketch in each bucket directly? Compress technique takes about half of the number of tuples given by - approximate sketch

23 Merge  There are h data stream D i,and each D i has N i data elements. Suppose each S i is an e-approximate sketch of D i.  S merge is a sketch of  |S merge | =  Suppose each S i is an e-approximate sketch. Then, S merge is also an e-approximate sketch

24 Another Problem 5, 6, 7, 8,1, 2, 3, 4, Expired 9 Current ε=1 and N = 8 Approximate sketch The first tuple in S merge is, but the rank of 5 is 4. S merge is not an - approximate sketch

25 Lift  To solve the pervious problem, we use a “lift” operation to lift the value of by for each tuple i  If S is an - approximate sketch, then S lift is an e-approximate sketch  That is why the bucket size is and we maintain - approximate sketch of each bucket summary

26 Query Step1. merge the local sketch … S merge Step2. lift S merge lift S lift Current bucket Step3. for a given rank r =,find the first tuple in S lift such that, return v i

27 Space – Sliding Window LLXY ‘04 1/є²+(log (є²N)/є))  O(1/є² +(log (є²N)/є))Reason: Sketch in each bucket produced by the GK algorithm takes O (log (є²N)/є) space which will be compressed to O(1/є) once the bucket is full Sketch in each bucket produced by the GK algorithm takes O (log (є²N)/є) space which will be compressed to O(1/є) once the bucket is full O(1/є) buckets O(1/є) buckets

28 Performance Studies  Sliding window model Compare with the ARS-algorithm  Avg Errors  Space Consumption  Distributions  n-of-N model Compare with the heuristic algorithm nN’  Avg Errors  Space Consumption  Query performance

29 Conclusion  This work presented is among the first attempts to develop space efficient, one pass, deterministic quantile summary algorithms with performance guarantees under the sliding window model of data streams

30 Approximating quantiles using sliding window model - Manku’s Approximating Quantiles:  GK Algorithm + Concept of Aging  Improves over [ LLXY `04 ] 1/є²+(log (є²N)/ є)) [LLXY `04] space: O(1/є² +(log (є²N)/ є)) Manku’s Space: 1/є(log (1/є log N))) Manku’s Space: O(1/є(log (1/є log N))) The space complexity is achieved by minimising the space used for maintaining the state The space complexity is achieved by minimising the space used for maintaining the state at any point in time,e-approximate quantiles, for any (0; 1]) over the current contents of the sliding window can be computed using the maintained state.  The goal is to minimize the space required for maintaining the state.

31 Overview N

32 Overview N

33 Overview N

34 Overview N

35 Overview N

36 Overview N

37 Overview N

38 Overview N

39 Details N єNєN 4 1 є log ( ) є 1 є 0 є 2 = O(єN)

40 1/є(log (1/є log N))) Space Requirement O(1/є(log (1/є log N))) Space required for level-ℓ blocks: 1 є ℓ x N N ℓ Size of a quantile sketch Number of “active” blocks N єN / log ( 1 є ) == 1 є 1 є () x 1/є log єN Space required for GK Algorithm = 1/є log єN 1/є log єN = 1/є(log (1/є log N))) O(1/є(log (1/є log N))) 1 є 1 є log ()

41 Conclusion  The work presented is better than the first method with respect to space.  This paper also provides a randomized quantile finding algorithm with further improvement in space.

42 Any Question?