1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.

Slides:



Advertisements
Similar presentations
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Analysis of Algorithms
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efficient Computation of Equi-Depth Histograms.
Introduction to Algorithms
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Advanced Topics in Algorithms and Data Structures Lecture 7.1, page 1 An overview of lecture 7 An optimal parallel algorithm for the 2D convex hull problem,
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
2-dimensional indexing structure
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
CPSC 231 B-Trees (D.H.)1 LEARNING OBJECTIVES Problems with simple indexing. Multilevel indexing: B-Tree. –B-Tree creation: insertion and deletion of nodes.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
CS4432: Database Systems II
Priority Queues, Heaps & Leftist Trees
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
1 Chapter 1 Analysis Basics. 2 Chapter Outline What is analysis? What to count and consider Mathematical background Rates of growth Tournament method.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Chapter 3 Sec 3.3 With Question/Answer Animations 1.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
1 Heaps and Priority Queues Starring: Min Heap Co-Starring: Max Heap.
Chapter 7: Sorting Algorithms Insertion Sort. Sorting Algorithms  Insertion Sort  Shell Sort  Heap Sort  Merge Sort  Quick Sort 2.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
CSC 211 Data Structures Lecture 13
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
Sorting CS 110: Data Structures and Algorithms First Semester,
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Priority Queues and Heaps. October 2004John Edgar2  A queue should implement at least the first two of these operations:  insert – insert item at the.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Space-Efficient Online Computation of Quantile Summaries Michael Greenwald & Sanjeev Khanna University of Pennsylvania Presented by nir levy.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
Lecture 9COMPSCI.220.FS.T Lower Bound for Sorting Complexity Each algorithm that sorts by comparing only pairs of elements must use at least 
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Today’s Material Sorting: Definitions Basic Sorting Algorithms
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Multiway Search Trees Data may not fit into main memory
COSC160: Data Structures Binary Trees
Chapter 12: Query Processing
Multi-Way Search Trees
Ch. 8 Priority Queues And Heaps
Advance Database System
Dynamic Data Structures for Simplicial Thickness Queries
Lecture 2- Query Processing (continued)
Spatial Indexing I R-trees
Database Design and Programming
The Selection Problem.
Presentation transcript:

1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004

2 References M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD, pages 58-66, X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. In ICDE, pages , 2004

3 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm  Data Structure  Operations  Space Complexity Analysis Sliding Window Model

4 Problem Definitions  -Quantile: A  -quantile (  (0,1]) of an ordered sequence of N data elements is the element with rank  N . Quantile Query: Given , find the data element with rank  N  among all elements in the stream.  Variation: N recent elements (sliding window model). (  -approximate): Find the element with rank r within the interval [r-  N, r+  N].

5 Example of A Quantile Query The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, quantile returns the element ranked 8, which is approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}. t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3

6 Why Approximation? Munro and Paterson (Theoretical Computer Science, 1980) showed that any algorithm which exactly computes  -quantile of N data elements in p passes, requires a space of   . Approximate quantile techniques are necessary to achieve sub-linear space efficiency.

7 Quantile Summary Quantile Summary: A small number of objects from the input data sequence, which could be used (by quantile estimator) to answer quantile queries. Other summary methods of large data sets include average, standard deviation, histogram, counting sketch (FM-sketch), etc.

8 Properties of A Good Quantile Estimator Provide tunable and explicit a priori guarantees on the precision of the approximation, e.g. it is  - approximate. Data independent. Use as small a memory footprint as possible, which includes temporary storage.

9 Previous Work Manku, Rajagopalan, and Lindsay (SIGMOD, 1998) proposed a single-pass algorithm that constructs an  -approximate quantile summary.  Space complexity:  log 2  N .  It requires an advance knowledge of N, the size of data set. Won’t work in data stream environment.

10 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm GK-quantile  Data Structure  Operations  Space Complexity Analysis Sliding Window Model

11 Contributions of GK-algorithm Dynamically adjust quantile summary with the growth of N, the total number of data elements in the data stream. Space complexity is reduced to  log  N .

12 Assumptions A new data element arrives after each unit of time. n denotes both the number of elements of the data sequence, as well as the current time. A data element is represented by its value v. r min (v) and r max (v) denote respectively the lower and upper bounds on the actual rank r of v among the elements seen so far.

13 The Summary Data Structure GK-algorithm maintains a summary data structure S=S(n) at any point in time n. S(n) consists of an ordered (non-decreasing) sequence of tuples which corresponds to a subset of the elements seen thus far.

14 The Summary Data Structure S = {t 0, t 1, …, t s-1 }, where t i = (v i, g i, Δ i ).  v i is the value of one of the elements seen so far.  g i = r min (v i ) - r min (v i-1 )  Δ i = r max (v i ) - r min (v i ) v 0 and v s-1 always correspond to the minimum and the maximum elements seen so far.

15 The Summary Data Structure Given g i = r min (v i ) - r min (v i-1 ) and Δ i = r max (v i ) - r min (v i ),  r min (v i ) =  j  i g j  r max (v i ) =  j  i g j +Δ i g i +Δ i -1 is upper bound on the total number of elements that may have fallen between v i-1 and v i. r min (v s-1 ) =  i g j = n.

16 Example of A Quantile Summary {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples. For clarity, re-write the tuples of the above summary in the form t i = (v i, r min (v i ), r max (v i )) as follows: {(1,1,1), (2,2,9), (3,3,10), (4,4,10), (10,10,10), (12,16,16)}. t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3

17 Error Rate? P ROPOSITION 1: Given a quantile summary S, a  - quantile can always be identified to within an error of max i (g i + Δ i )/2. C OROLLARY 1: If at any time n, the summary S(n) satisfies the property that max i  g i +  i   2  n, than we can answer any  -quantile query to within an  n precision.

18 QUANTILE (  ) QUANTILE(  ): To compute an  -approximate  -quantile from the summary S(n) after n data elements, compute the rank r=  n . Find i such that both r  r min (v i )   n and r max (v i )  r   n, return v i.  i.e. r   n  r min (v i )  r max (v i )  r   n

19 Example of A Quantile Summary {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is 0.25-approximate with respect to the data stream. An 0.25-approximate 0.5-quantile returns the element (4,1,6) or (10,6,0). t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3

20 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm  Data Structure  Operations  Space Complexity Analysis Sliding Window Model

21 How does their algorithm work? Insert a tuple in the summary corresponding to a new incoming element. Periodically sweep over the summary to “merge” some of the tuples into their neighbors.  It ensures the space requirement. At all times max i (g i +Δ i )  2  n. What to merge & How to merge?

22 INSERT (v) INSERT(v): Find the smallest i, such that v i -1  v  v i, and insert the tuple (v, 1,  2  n  ), between t i-1 and t i. Increment s. As a special case, if v is the new minimum or the maximum element seen, then insert (v, 1, 0).

23 Example of INSERT S={(12, 1, 0)}, n=1 S={(6, 1, 0), (12, 1, 0)}, n=2 S={(6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=3 S={(1, 1, 0), (6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=4 t 0 12 t 3 10 t41t41 t86t86

24 Merge Space will increase with insertions. Intuitively, two tuples (v i, g i,Δ i ) and (v j, g j,Δ j ) can be merged into a new tuple (v k, g k,Δ k ), as long as g k +Δ k  2  n. An individual tuple is full if g k +Δ k   2  n . Capacity and Band are introduced.

25 Capacity and Band The capacity of a tuple is the maximum numer of elements that can be counted by g i before the tuple become full. ( g i  2  n   i ).  The merge phase will free up space by merging tuples with small capacities into tuples with similar or larger capacities. Bands: Roughly speaking, divide the Δs into bands that lie between elements of (0, ½  2  n, ¾  2  n, …, 2i-1  2i  2  n, …, 2  n-1, 2  n ). The larger the capacity (with smallerΔ), the larger the band.

26 Example of A Quantile Summary {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples. (2,1,7) and (3,1,7) are in the lowest band. (1,1,0), (10,6,0) and (12,6,0) are in the highest bands. t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3

27 Band Strictly, Given  from 1 to  log2  n , p=  2  n , band  is the set of all Δ such that p  2   (p mod 2  )  Δ  p  2  -1  (p mod 2  -1 ).  If two Δ s are ever in the same band, they never appear in different bands as n increase.  In band 0,Δ=  2  n . A tree structure is imposed to facilitate merges between bands.

28 Tree Representation Given a summary S = {t 0, t 1, …, t s-1 }, the tree T associated with S contains a node V i for each t i and a special root node R. The parent of a node V i is the node V j such that j is the least index greater than i with band(t i ) > band(t j ). Otherwise R is the parent.

29 Tree Representation P ROPOSITION 3: The children of any node in T are always arranged in non-increasing order of band in S. P ROPOSITION 4: For any node V, the set of all its descendants arranged in T forms a contiguous segment in S. (1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0) R

30 Merge Actually GK-algorithm will merge together a node and all its descendants into either its parent node or into its right sibling. The tuple that results after the merge must not be full, i.e.  g i +  i   2  n. The operation is called COMPRESS().

31 COMPRESS ( ) The operation COMPRESS tries to merge together a node and all its descendants into either parent node or into its right sibling. COMPRESS() for i from s-2 to 0 do if ((BAND(  i, 2  n)  BAND(  i+1, 2  n) ) && g*  g i+1  i+1  2  n)) then DELETE all descendants of t i and the tuple t i itself; end if end for end COMPRESS g* denotes the sum of g-values of the tuple t i and all its descendants in T.

32 DELETE (v i ) DELETE(v i ): To delete the tuple (v i, g i,Δ i ) from S, replace (v i, g i,Δ i ) and (v i+1, g i+1,Δ i+1 ) by the new tuple (v i+1, g i + g i+1,Δ i+1 ), and decrement s.

33 Example of COMPRESS and DELETE S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12, 1, 0)}, s=6, n=6 Compress tuples (11, 1, 1) and (12, 1, 0) into a new tuple (12, 2, 0). S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, s=5, n=6 t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10

34 Pseudo-Code for the whole algorithm Initial State S  ; s  0; n  0; Algorithm To add the n+1 st element, v, to summary S(n): if (n  0 mod 1  2  ) then COMPRESS(); end if INSERT (v); n=n+1;

35 A Complete Example ( ) S={(10, 1, 0), (12, 1, 0)}, n=2 S={(10, 1, 0), (10, 1, 1), (11, 1, 1), (12, 1, 0)}, n=4 S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12, 1, 0)}, n=6, s=6 Perform compress when t 6 comes. S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, n=6, s=5 t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11

36 A Complete Example ( ) S={(1, 1, 0), (9, 1, 3), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 3), (12, 2, 0)}, n=8, s=7 Perform compress when t 8 comes. S={(1, 1, 0), (10, 2, 0), (10, 1, 1), (10, 1, 2), (12, 3, 0)}, n=8, s=5 t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86

37 A Complete Example ( ) S={(1, 1, 0), (4, 1, 6), (5, 1, 6), (10, 5, 0), (12, 6, 0)}, n=14, s=5 Perform compress S={(1, 1, 0), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=14, s=4 Finally S={(1, 1, 0), (2, 1, 7), (3, 1, 7), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=16, s=6 t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3

38 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm  Data Structure  Operations  Space Complexity Analysis Sliding Window Model

39 Band Property Observe that the number of band and elements in a band determine the space complexity. P ROPOSITION 2: At any point in time n and for any  1, band  (n) contains either 2  or 2  -1 distinct values ofΔ.  Since no more than 1  2  elements with any givenΔ are inserted, band  is a summary of at most 2   2  elements in the stream.

40 L EMMA s L EMMA 3 : At any time n and for any given , there are at most 3  2  nodes in T(n) that have a child with band value of .  Only a small number of nodes can have a child with band . See Proposition 3.Proposition 3

41 L EMMA s A full pair of tuples (t i-1, t i ) : band(t i-1 )  band(t i ). The tuple t i-1 is left partner and t i is a right partner in this full pair. L EMMA 4 : At any time n and for any given , there are at most 4  tuples from band  (n) that are right partners in a full tuple pair.

42 Full Pair Example {(2,1,7), (3,1,7)} and is a full pair {(1,1,0), (2,1,7)} is not a full pair. (2,1,7) can only be a left partner! (1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0) R

43 Space Efficiency Any band  (n) node either is a right partner of a full pair, or can only be a left partner. By Proposition 3, a band  (n) node that can only be a left partner only occurs once for every parent of nodes from band  (n).Proposition 3 By Lemma 3 and 4, the number of nodes in any band is bounded by 3  2   4   11  2 .34

44 Space Efficiency The number of band is  1. T HEOREM : At any time n, the total number of tuples stored in S(n) is at most (11  2  )log(2  n). GK-algorithm’s space complexity is  log  N .

45 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm  Data Structure  Operations  Space Complexity Analysis Sliding Window Model

46 Sliding Window Model Under sliding window model, a summary is maintained for the most recently seen N data elements. Eliminate exact out-dated elements requires a space of O(N). Lin, etc. (ICDE 2004) proposed a space- efficient one-pass summary algorithm for sliding window model. Their underlying summary algorithm is GK-algorithm.

47 n-of-N Model A summary is maintained for N most recently seen data elements. However, quantile queries can be issued against any n  N. That is, for any  (0,1], and any n  N, we can return  -quantiles among the n most recent elements in a data stream seen so far. Lin, etc. (ICDE 2004) proposed their one-pass summary algorithm combining EH partitioning technique (Datar, etc. ACM-SIAM 2002) with GK- algorithm, solving n-of-N model.

48 Example of n-of-N model Assume the sliding window is 16 in an n-of-N model. A quantile query can be answered for any 1  n  quantile returns 6 for n=12 and 3 for n=4. FYI: The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12. t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3

49 Thank you!