1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004
2 References M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. In SIGMOD, pages 58-66, X. Lin, H. Lu, J. Xu, and J. X. Yu. Continuously Maintaining Quantile Summaries of the Most Recent N Elements over a Data Stream. In ICDE, pages , 2004
3 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm Data Structure Operations Space Complexity Analysis Sliding Window Model
4 Problem Definitions -Quantile: A -quantile ( (0,1]) of an ordered sequence of N data elements is the element with rank N . Quantile Query: Given , find the data element with rank N among all elements in the stream. Variation: N recent elements (sliding window model). ( -approximate): Find the element with rank r within the interval [r- N, r+ N].
5 Example of A Quantile Query The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, quantile returns the element ranked 8, which is approximate 0.5-quantile returns one of the elements in {4,5,6,7,8,9,10}. t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3
6 Why Approximation? Munro and Paterson (Theoretical Computer Science, 1980) showed that any algorithm which exactly computes -quantile of N data elements in p passes, requires a space of . Approximate quantile techniques are necessary to achieve sub-linear space efficiency.
7 Quantile Summary Quantile Summary: A small number of objects from the input data sequence, which could be used (by quantile estimator) to answer quantile queries. Other summary methods of large data sets include average, standard deviation, histogram, counting sketch (FM-sketch), etc.
8 Properties of A Good Quantile Estimator Provide tunable and explicit a priori guarantees on the precision of the approximation, e.g. it is - approximate. Data independent. Use as small a memory footprint as possible, which includes temporary storage.
9 Previous Work Manku, Rajagopalan, and Lindsay (SIGMOD, 1998) proposed a single-pass algorithm that constructs an -approximate quantile summary. Space complexity: log 2 N . It requires an advance knowledge of N, the size of data set. Won’t work in data stream environment.
10 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm GK-quantile Data Structure Operations Space Complexity Analysis Sliding Window Model
11 Contributions of GK-algorithm Dynamically adjust quantile summary with the growth of N, the total number of data elements in the data stream. Space complexity is reduced to log N .
12 Assumptions A new data element arrives after each unit of time. n denotes both the number of elements of the data sequence, as well as the current time. A data element is represented by its value v. r min (v) and r max (v) denote respectively the lower and upper bounds on the actual rank r of v among the elements seen so far.
13 The Summary Data Structure GK-algorithm maintains a summary data structure S=S(n) at any point in time n. S(n) consists of an ordered (non-decreasing) sequence of tuples which corresponds to a subset of the elements seen thus far.
14 The Summary Data Structure S = {t 0, t 1, …, t s-1 }, where t i = (v i, g i, Δ i ). v i is the value of one of the elements seen so far. g i = r min (v i ) - r min (v i-1 ) Δ i = r max (v i ) - r min (v i ) v 0 and v s-1 always correspond to the minimum and the maximum elements seen so far.
15 The Summary Data Structure Given g i = r min (v i ) - r min (v i-1 ) and Δ i = r max (v i ) - r min (v i ), r min (v i ) = j i g j r max (v i ) = j i g j +Δ i g i +Δ i -1 is upper bound on the total number of elements that may have fallen between v i-1 and v i. r min (v s-1 ) = i g j = n.
16 Example of A Quantile Summary {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples. For clarity, re-write the tuples of the above summary in the form t i = (v i, r min (v i ), r max (v i )) as follows: {(1,1,1), (2,2,9), (3,3,10), (4,4,10), (10,10,10), (12,16,16)}. t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3
17 Error Rate? P ROPOSITION 1: Given a quantile summary S, a - quantile can always be identified to within an error of max i (g i + Δ i )/2. C OROLLARY 1: If at any time n, the summary S(n) satisfies the property that max i g i + i 2 n, than we can answer any -quantile query to within an n precision.
18 QUANTILE ( ) QUANTILE( ): To compute an -approximate -quantile from the summary S(n) after n data elements, compute the rank r= n . Find i such that both r r min (v i ) n and r max (v i ) r n, return v i. i.e. r n r min (v i ) r max (v i ) r n
19 Example of A Quantile Summary {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is 0.25-approximate with respect to the data stream. An 0.25-approximate 0.5-quantile returns the element (4,1,6) or (10,6,0). t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3
20 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm Data Structure Operations Space Complexity Analysis Sliding Window Model
21 How does their algorithm work? Insert a tuple in the summary corresponding to a new incoming element. Periodically sweep over the summary to “merge” some of the tuples into their neighbors. It ensures the space requirement. At all times max i (g i +Δ i ) 2 n. What to merge & How to merge?
22 INSERT (v) INSERT(v): Find the smallest i, such that v i -1 v v i, and insert the tuple (v, 1, 2 n ), between t i-1 and t i. Increment s. As a special case, if v is the new minimum or the maximum element seen, then insert (v, 1, 0).
23 Example of INSERT S={(12, 1, 0)}, n=1 S={(6, 1, 0), (12, 1, 0)}, n=2 S={(6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=3 S={(1, 1, 0), (6, 1, 0), (10, 1, 1), (12, 1, 0)}, n=4 t 0 12 t 3 10 t41t41 t86t86
24 Merge Space will increase with insertions. Intuitively, two tuples (v i, g i,Δ i ) and (v j, g j,Δ j ) can be merged into a new tuple (v k, g k,Δ k ), as long as g k +Δ k 2 n. An individual tuple is full if g k +Δ k 2 n . Capacity and Band are introduced.
25 Capacity and Band The capacity of a tuple is the maximum numer of elements that can be counted by g i before the tuple become full. ( g i 2 n i ). The merge phase will free up space by merging tuples with small capacities into tuples with similar or larger capacities. Bands: Roughly speaking, divide the Δs into bands that lie between elements of (0, ½ 2 n, ¾ 2 n, …, 2i-1 2i 2 n, …, 2 n-1, 2 n ). The larger the capacity (with smallerΔ), the larger the band.
26 Example of A Quantile Summary {(1,1,0), (2,1,7), (3,1,7), (4,1,6), (10,6,0), (12,6,0)} is an quantile summary consisting of 6 tuples. (2,1,7) and (3,1,7) are in the lowest band. (1,1,0), (10,6,0) and (12,6,0) are in the highest bands. t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3
27 Band Strictly, Given from 1 to log2 n , p= 2 n , band is the set of all Δ such that p 2 (p mod 2 ) Δ p 2 -1 (p mod 2 -1 ). If two Δ s are ever in the same band, they never appear in different bands as n increase. In band 0,Δ= 2 n . A tree structure is imposed to facilitate merges between bands.
28 Tree Representation Given a summary S = {t 0, t 1, …, t s-1 }, the tree T associated with S contains a node V i for each t i and a special root node R. The parent of a node V i is the node V j such that j is the least index greater than i with band(t i ) > band(t j ). Otherwise R is the parent.
29 Tree Representation P ROPOSITION 3: The children of any node in T are always arranged in non-increasing order of band in S. P ROPOSITION 4: For any node V, the set of all its descendants arranged in T forms a contiguous segment in S. (1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0) R
30 Merge Actually GK-algorithm will merge together a node and all its descendants into either its parent node or into its right sibling. The tuple that results after the merge must not be full, i.e. g i + i 2 n. The operation is called COMPRESS().
31 COMPRESS ( ) The operation COMPRESS tries to merge together a node and all its descendants into either parent node or into its right sibling. COMPRESS() for i from s-2 to 0 do if ((BAND( i, 2 n) BAND( i+1, 2 n) ) && g* g i+1 i+1 2 n)) then DELETE all descendants of t i and the tuple t i itself; end if end for end COMPRESS g* denotes the sum of g-values of the tuple t i and all its descendants in T.
32 DELETE (v i ) DELETE(v i ): To delete the tuple (v i, g i,Δ i ) from S, replace (v i, g i,Δ i ) and (v i+1, g i+1,Δ i+1 ) by the new tuple (v i+1, g i + g i+1,Δ i+1 ), and decrement s.
33 Example of COMPRESS and DELETE S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12, 1, 0)}, s=6, n=6 Compress tuples (11, 1, 1) and (12, 1, 0) into a new tuple (12, 2, 0). S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, s=5, n=6 t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10
34 Pseudo-Code for the whole algorithm Initial State S ; s 0; n 0; Algorithm To add the n+1 st element, v, to summary S(n): if (n 0 mod 1 2 ) then COMPRESS(); end if INSERT (v); n=n+1;
35 A Complete Example ( ) S={(10, 1, 0), (12, 1, 0)}, n=2 S={(10, 1, 0), (10, 1, 1), (11, 1, 1), (12, 1, 0)}, n=4 S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 1), (12, 1, 0)}, n=6, s=6 Perform compress when t 6 comes. S={(1, 1, 0), (10, 1, 0), (10, 1, 1), (10, 1, 2), (12, 2, 0)}, n=6, s=5 t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11
36 A Complete Example ( ) S={(1, 1, 0), (9, 1, 3), (10, 1, 0), (10, 1, 1), (10, 1, 2), (11, 1, 3), (12, 2, 0)}, n=8, s=7 Perform compress when t 8 comes. S={(1, 1, 0), (10, 2, 0), (10, 1, 1), (10, 1, 2), (12, 3, 0)}, n=8, s=5 t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86
37 A Complete Example ( ) S={(1, 1, 0), (4, 1, 6), (5, 1, 6), (10, 5, 0), (12, 6, 0)}, n=14, s=5 Perform compress S={(1, 1, 0), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=14, s=4 Finally S={(1, 1, 0), (2, 1, 7), (3, 1, 7), (4, 1, 6), (10, 6, 0), (12, 6, 0)}, n=16, s=6 t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3
38 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm Data Structure Operations Space Complexity Analysis Sliding Window Model
39 Band Property Observe that the number of band and elements in a band determine the space complexity. P ROPOSITION 2: At any point in time n and for any 1, band (n) contains either 2 or 2 -1 distinct values ofΔ. Since no more than 1 2 elements with any givenΔ are inserted, band is a summary of at most 2 2 elements in the stream.
40 L EMMA s L EMMA 3 : At any time n and for any given , there are at most 3 2 nodes in T(n) that have a child with band value of . Only a small number of nodes can have a child with band . See Proposition 3.Proposition 3
41 L EMMA s A full pair of tuples (t i-1, t i ) : band(t i-1 ) band(t i ). The tuple t i-1 is left partner and t i is a right partner in this full pair. L EMMA 4 : At any time n and for any given , there are at most 4 tuples from band (n) that are right partners in a full tuple pair.
42 Full Pair Example {(2,1,7), (3,1,7)} and is a full pair {(1,1,0), (2,1,7)} is not a full pair. (2,1,7) can only be a left partner! (1,1,0) (2,1,7) (3,1,7) (4,1,6) (10,6,0) (12,6,0) R
43 Space Efficiency Any band (n) node either is a right partner of a full pair, or can only be a left partner. By Proposition 3, a band (n) node that can only be a left partner only occurs once for every parent of nodes from band (n).Proposition 3 By Lemma 3 and 4, the number of nodes in any band is bounded by 3 2 4 11 2 .34
44 Space Efficiency The number of band is 1. T HEOREM : At any time n, the total number of tuples stored in S(n) is at most (11 2 )log(2 n). GK-algorithm’s space complexity is log N .
45 Outline of this talk Quantile Estimation Overview GK-quantile Summary Algorithm Data Structure Operations Space Complexity Analysis Sliding Window Model
46 Sliding Window Model Under sliding window model, a summary is maintained for the most recently seen N data elements. Eliminate exact out-dated elements requires a space of O(N). Lin, etc. (ICDE 2004) proposed a space- efficient one-pass summary algorithm for sliding window model. Their underlying summary algorithm is GK-algorithm.
47 n-of-N Model A summary is maintained for N most recently seen data elements. However, quantile queries can be issued against any n N. That is, for any (0,1], and any n N, we can return -quantiles among the n most recent elements in a data stream seen so far. Lin, etc. (ICDE 2004) proposed their one-pass summary algorithm combining EH partitioning technique (Datar, etc. ACM-SIAM 2002) with GK- algorithm, solving n-of-N model.
48 Example of n-of-N model Assume the sliding window is 16 in an n-of-N model. A quantile query can be answered for any 1 n quantile returns 6 for n=12 and 3 for n=4. FYI: The sorted order of the sequence is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 11, 11, 11, 12. t 0 12 t 1 10 t 2 11 t 3 10 t41t41 t 5 10 t 6 11 t79t79 t86t86 t97t97 t 10 8 t t 12 4 t 13 5 t 14 2 t 15 3
49 Thank you!