Space-Efficient Online Computation of Quantile Summaries Michael Greenwald & Sanjeev Khanna University of Pennsylvania Presented by nir levy
Introduction The problem We introduced a very large data sets and we wish to compute Φ-quatiles in a single pass using space-efficient computation. Def: The Φ-quantiles of an ordered sequence of N data items is the value with rank ΦN. (the element in the ΦN position) We are going to see an online algorithm for computing ε-approximate quatile summaries of a very large data sequence. Def: An ε-approximate quantile summaries of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of εN. Def: A quantile summary consists of a small number of points from the input data sequence, and uses those quantile estimates to give approximate responses to any arbitrary quantile query.
Introduction cont… EXAMPLE Input data: 14, 2, 12, 5, 6, 19, 1, 14, 4, 9, 12, 3, 8, 11, 15, 4. Ordered: 19, 15, 14, 14, 12, 12, 11, 9, 8, 6, 5, 4, 4, 3, 2, 1 Rank: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 what is the 2 nd biggest number? (15) What is 25%th number? (14) Summary: 19, 14, 11, 6, 4, 1 rank: what is the 2 nd biggest number? 2 nd 1 st (19) What is 25%th number? 16*0.25=4 4 th (14)
Quantile estimation for Database Applications Estimate the size of intermediate results, to allow query optimizers to estimate the cost of competing plans to resolve database queries. Partition data into roughly equal partitions for parallel database. Prevent expensive and incorrect queries from being issued By estimate results sizes and give feedback to the users Characterize the distribution of real world data sets for database users.
Properties Properties for quantile estimators 1.provide tunable and explicit guarantees on the precision of the approximation. That is, for any given rank r, an ε-approximate quantile summary return a value whose rank r’ is guaranteed to be within the interval [r-εN, r+εN]. 2.be data independent. That is, neither affected by the arrival order or distribution of the values nor should it require a priori knowledge of the size of the dataset. 3.execute in a single pass over the data. 4.have as small of memory footprints as possible (apply to temporary storage during the computation)
Previous Work Mnku, Rajagopalan and Lindsay presented single-pass algorithm, Ɛ - approximate quantile summary, requires O(1/ε * log 2 (εN) space but need and advanced knowledge of N ( otherwise they provide a probabilistic guarantee on the precision) (MRL). Gibson, Matis and Poosala presented multiple pass algorithm with probabilistic guarantee Munro and Paterson showed that any algorithm that exactly compute Φ-quantile in in only P passes requires a space of (N 1/p )
This algorithm present a worse-case space requirement of O(1/ Ɛ *log ( Ɛ N)), thus improving upon the previous best result of O(1/ Ɛ *log 2 ( Ɛ N)). in contrast to earlier algorithms, the algorithm doesn’t require a priori knowledge of the length of the input sequence based on a novel data structure that effectively maintains the range of possible ranks for each quantile that they store. The behavior is based on the fact that no input sequence can be “bad” across the entire distribution that is, the input sequence cannot present new observations that must be stored without deleting old stored observations.
The Data Structure Assume w.l.og. That every new observation arrives after each unit of time. Denote n to be the number of observation seen so far as well as the current time. Denote ε to be the given precision requirement Denote S=S(n) to be the summary data structure at all time. S(n) consists of an ordered sequence elements corresponding to a subset of the observations seen thus far For each observation v in S, maintain an implicit bound on the minimum and the maximum possible rank of v among the first n observations. (Denote by R min (v) and R max (v))
Data structure cont… More formally let S(n) be the set of tuples t 0,t 1,…,t s-1 where t i =(V i,g i,∆ i ) Vi – is one of the elements for the data stream gi – is equal R min (V i ) - R min (V i-1 ) ∆ I – is equal R max (V i ) - R min (V i ) ∑ j<=I g j = R min (V i ) - R min (V i-1 ) + R min (V i-1 ) - R min (V i+2 ) R min (V 1 )- R min (V 0 )= R min (V i ) (∑ j<=I g i )+∆ I = R max (V i ) - R min (V i ) + R min (V i ) = R max (V i )
Data structure cont… At all time ensure that V 0 and V s-1 correspond to the minimum and maximum element seen so far. g i +∆ i -1 is the upper bound on the total number of observations that may have fallen between v i and v i-1 ∑ i g i is the number of observations seen so far
Answering Quantile Queries Proposition 1: Given a quantile summary S in the above form a Φ-quantile can always be identified to within an error of MAX i (g i +∆ i )/2. Proof. let r= Φn and let e=MAX i (g i +∆ i )/2. - search for an index i such that r-e <= R min (V i ) and R max (v i )<= r+e V0V0 V s-1 ΦnΦn ViVi Max i (g i +∆ i ) R min (V i ) R max (V i ) vi approximates the Φ-quantile within the claimed error bound.
Answering Quantile Queries cont… All is left to see is that such an index I must always exist. V0V0 r V s-1 n-e Consider the case r>n-e We have R min (V s-1 )=R max (V s-1 )=n and therefore i=s-1 is valid Otherwise r<=n-e Choose the smallest j such R max (V j )>r+e it follows that R min (V j-1 )>=r-e Since for R min (V j-1 ) R min (V j-1 )+2e r V s-1 r+er-e R min (V j-1 ) R max (V j ) V0V0 Contradiction to the assumption that e=MAX i (g i +∆ i )/2
Answering Quantile Queries cont… By assumption R max (V j-1 )<=r+e therefore j-1 is an example of an index i with the desired property. Corollary 1 if at any time n, the summery S(n) satisfied the property that MAX i (g i +∆ i ) <=2εn, then we can answer any Φ-quantile query to within an εn precision.
Data structure cont… At high level On a new observation – insert in the summary a tuple corresponding to this observation. Periodically, perform a sweep over the summary to “merge” some of the tuples into their neighbors so as to free space Maintain several condition in order to bound the space used by S at any time. By corollary 1 in suffice to ensure that at all time MAX i (g i +∆ i ) <=2εn. Def: An individual tuple is full if gi+∆i= 2εn . Def: The capacity of an individual tuple is the maximum number of observations that can be counted by g i before the tuple become full
BANDS General strategy: delete tuples with small capacities and preserve tuples with large capacities. In the merge phase, free up space by merging tuples with small capacities into tuples with “similar” or larger capacities. We say, two tuples t i and t j have similar capacities, if log capacity(t i ) log capacity(t j ) This notion of similarity partition the possible values of ∆ into Bands we try to divide the ∆’s in bands that lie between elements of 0, ½(2εn), ¾(2εn),…..((2 i -1)/2 i )(2εn),…, 2εn-1, 2εn this boundaries correspond to capacities of 2εn, εn, 1/2εn,…,(1/2 i )εn,..8,4,2,1
BANDS cont… Define band α to be the set of all ∆ such that : p - 2 α - (p mod 2 α ) < ∆ <= p - 2 α-1 – (p mod 2 α-1 ) where p= 2εn and α = 1.. log(2εn) The above definition ensure that if two ∆s are ever in the same band, they never appear in different bands as n increases Define band 0 simply to be p Consider the first 1/2ε observations, with ∆ = 0 to be in a band of their own.
BANDS cont… Example Consider ε=1/8. a b c d e f g ∆= 0,0,0,0,1,1,1,1,2,2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6 N=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27, {g}{f}{e}{d}{c}{b}Band 0 {f}{d,e}{d}{b,c}{b}Band 1 {b,c,d,e}{b,c} Band 2
BANDS cont…
Proposition 2: at any point in time n and for any α>=1 band α (n) contains either 2 α or 2 α-1 distinct value of ∆. PROOF according to the upper and lower bounds of band α 2εn - 2 α - (2εn mod 2 α ) < ∆ <= 2εn - 2 α-1 – (2εn mod 2 α-1 ) If ( 2εn mod 2 α ) < 2 α-1 then ( 2εn mod 2 α ) = ( 2εn mod 2 α-1 ) |band α | = 2 α - 2 α-1 = 2 α-1 distinct values of ∆ If ( 2εn mod 2 α ) >= 2 α-1 then ( 2εn mod 2 α ) = 2 α-1 + ( 2εn mod 2 α-1 ) |band α | = 2 α α-1 = 2 α distinct values of ∆
A tree representation For S = t 0, t 1, ….,t s-1 Impose a tree structure T over the tuples of S. Assign a special root node R for every tuple t i assign a node V i The parent of every node V i is the node V j such that j is the least index greater than i with band(t j ) > band(t i ). If no such j exist than set R to be the parent. All children (and all descendants) of a given node V i have ∆ values larger than ∆ I.
A tree representation Proposition 4: for any node V, the set of all its descendants in T form a contiguous segment in S Proposition 3: the children of any node in T are always arranged in non-increasing order of band in S
Operations To compute ε-approximate Φ-quantile from S(n) after n observations During the operations we wish to maintain correct relationship between g i, ∆ I, R min and R max QUANTILE(Φ): compute the rank r= Φn find i such that: r-R min (V i )<= εn and R max (V i )-r<=εn return V i. INSERT(V): find the smallest i such that: V i-1 <= V <V i and insert the tuple (V,1, 2εn ) between t i-1 and t i. If V is the new minimum or maximum seen, then insert (v,1,0)
Operations Cont… INSERT(V) maintains maintain correct relationship between g i, ∆ I, R min and R max If V is inserted before V i the value of R min (V) may be as small as R min (V i -1)+1 similarly R max (V) may be as large as the current R max (V i ) which is bounded by 2εn . Note that R min (V i ) and R max (V i ) get increased by 1 after insertion.
Operations Cont… DELETE(V i ): replace the tuple (V i,g i,∆ i ) and (V i+1,g i+1,∆ i+1 )with the new tuple (V i+1,g i +g i+1,∆ i+1 ). Deleting V i has no effect on R min (V i+1 ) R max (V i+1 ) so it should simply preserve them. The relationship between R min (V i+1 ) and R max (V i+1 ) is preserved as long as ∆ i+1 is unchanged. since R min (V i+1 ) = ∑ j<=I+1 g i and we deleted g i we must increase g i+1 by g i to keep R min (V i+1 ).
COMPRESS The operation COMPRESS tries to merge together a node and all its descendents into either its parent node or into its right sibling (by deleting them). During compress we must ensure that the tuple results after the merging is not full Two adjacent tuples t i,t i+1 are mergeable if the resulting tuple is not full and band(t i,n)<=band(t i+1,n). Note that pair of tuples that are not mergeable at some point in time may be come so at later point as the term 2εn increases over time. Let g i * denote the sum of g-values of tuple t i and all it’s descendents in T.
Operations Cont… COMPRESS() for i from s-2 to 0 do if(BAND( Δ i,2 Ɛn ) ≤ BAND( Δ i+1,2 Ɛn )) && (g i * +g i+1 + Δ i+1 < 2 Ɛn ) then delete all descendants of t i and the tuple t i itself end if end for Compress inspect tuples from right (highest index) to left. it first combine children (and all their subtree of descendents) into their parents and only when the parent is full it combine children.
Operations Cont… Initial State S Φ ; s=0; n=0. Algorithm To add the n+1 st observation, v, to summary S(n): if(n ≡ 0 mod 1/(2 Ɛ) ) then COMPRESS(); end if INSERT(v); n=n+1;
Analysis The insert and compress operations always ensure that g i +∆ i <=2εn We will see now that the total number of tuples in the summary S(n) is bounded by (11/(2ε) * log (2εn)). Def: coverage – we say that a tuple t i in S(n) covers an observation v at any time n if either the tuple for v had been directly merged into t i or a tuple t that covered v has been merged into t i. A tuple always cover itself. It is easy to see that the number of observations covered by t i is exactly given by g i =g i (n)
Analysis Cont… Lemma 1: At no point in time a tuple from band α covers an observation from a band > α. Lemma 2: At any point in time n, and for any integer α, the total number of observations covered cumulatively by all tuples with band value in [0..α] is bounded by 2 α /ε. Lemma 3: At any time n and for any given α, there are at most 3/2ε nodes in T(n) that have a child with band value of α. That is, there are at most 3/2ε parents of nodes from band α (n)
Analysis Cont… PROOF of lemma 4 Let m min,m max denote the earliest and the latest time at which a node from band α could be seen. m min =(2εn-2 α -(2εn mod 2 α ))/2ε m max =(2εn-2 α-1 -(2εn mod 2 α-1 ))/2ε Choose a child parent pair (Vi,Vj) Vj is in band α Since Vj exist we can show that: Since at time m j (when V j showed up) we had: g i (m j )+∆ i <2εm j
Analysis Cont… Since for all pairs (v’ i, v’ j ) we have distinct observations The number of observations that came after m min is n-m min We get (n-m min )/(2ε*(n-m max ))=3/(2ε) Since m j is at most m max
Analysis Cont… Def: Given a full pair of tuples (t i-1,t i ), we say that a tuple t i-1 is left partner and t i is right partner in this full pair. Lemma 4: At any time n and for any given α, there are at most 4/ε tuples from band α (n) that are right partners in a full tuple pair. PROOF Let t i,t i+1,,t i+p-1 be the longest contiguous segment of tuples from band α (n) in S(n). Since they existed after the compress operation in must be the case g * j-1 +g j +∆ j >2εn for all i<=j<i+p
Analysis Cont… Summing over all j According to lemma 2 the first term is bounded by 2 α+1 /ε The second term is bounded by p(2εn-2 α-1 ) Summing the two bounds we get p<4/ε
Analysis Cont… for non- contiguous segments just consider the above summations over all such segments Lemma 5: At any time n and for any given α, the maximum number of tuples possible from each band α (n) is 11/2ε. Proof Each node of band α (n) is either: 1. a right partner in a full pair 2. a left partner in a full pair 3. not participate in any full pair The first case is bounded by 4/ε ( lemma 4) The last two are bounded by 3/2ε And the claim follow.
Analysis Cont… Theorem 1: At any time n, the number of tuples stored in S(n) is at most (11/(2ε) * log (2εn)). PROOF There are at most 1+ log(2εn) bands at time n Summing over their sizes we get (11/(2ε) * log (2εn)).
Experiments results The experiments were done on 3 different classes of input data 1. Hard Case. - an adversarial manner data sequence that is, place the next observation in the largest current “gap” of the quantile summary. 2. sorted input data. - the data arrives in sorted order. 3. random input data. - select each datum by selecting an element (without replacement) from a uniform distribution of all remaining elements in the data set
Experiments results cont… Sorted and random input data are used after the MRL experimental results Random input data can give an insight to the behavior of the algorithm on “average” inputs. In general, the algorithm used less space than indicated by the analysis. And turned out to be better than the MRL’s space requirement.
Experiments results cont… For each case we have 2 different kind of experiment: 1. Adaptive – the regular algorithm ( with a slight variation) 2. Pre-allocated – used the same space as used in the MRL We will see that in the later case the observed error is significantly better then the one of the MRL. differences in the algorithm used for the experiment : 1. An observation is inserted as a tuple (v,1,g i +Δ i -1) and not (v,1, 2 Ɛ n ). the latter is strictly to simplify theoretical analysis. 2. Rather than running the COMPRESS after every 1/2ε observations for each observation inserted one tuple was deleted when possible. if no tulpe could be deleted without making is successor full the size of S grew by 1.
Experiments results cont… We apply the following measurements: 1. The maximum space used to produce the summary –counting the number of stored tuples ( multiple by 3 for comparison with MRL to account the R min and R max values stored in each tuple ) 2. The observed precision of the results.
Experiments results cont… HARD INPUT The required number of quantile is approximately a factor of 11 less than the worst case bound of the analysis We almost always require less space than the MRL. The only exception is in epsilon=.001 and N=10 5 where MRL require less space
Experiments results cont… SORTED INPUT Fix ε=.001 and construct summaries of sorted sequences of size 10 5,10 6 and 10 7 Sample 15 quantiles at (q i /16)*N for q i =[1..15] and compute the maximum error over all possible quantile queries. Compare 3 algorithms: 1. MRL – preallocated the storage required by MRL as a function of N and ε. 2. pre-allocated – using 1/3 as many stored quantiles as MRL. 3. adaptive – storage allocated for new quantile only if no quantile could be deleted without exceeding a precision of.001n
Experiments results cont… |S| - the number of stored quantiles need to achieve the desired precision Max ε-the maximum error of all possible quantile queries of the summaries The remaining rows lists the approximation error of the response to the query for the q i /16 th quantile.
Experiments results cont… RANDOM INPUT Same measurements as in the sorted input (ε and sequence length) Run each experiment 50 times and report the max, min, mean and std for every measurement.
Experiments results cont…
Conclusions Improves upon the earlier results in two significant ways: 1.It improves the space complexity by a factor of Ω (log(εN)). 2. It doesn’t require a priori knowledge of the parameter N – that is, it allocates more space dynamically as the data sequence grows in size.