How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003
Problem Definition ► The Universe: U = {0, …, |U |-1} ► Number of records in data set: ||A||=N ► Data set can be thought of as an array: A[i] – number of records with value i ► A S – number of records with values in S ► The Ф-quantile of an ordered sequence of N data items are the value with rank ► Our goal is computing ε-approximate Ф-quantiles – find a j k such that:
Transactions ► Insert(i): A[i] A[i] + 1 ► Delete(i): A[i] A[i] – 1 ► Let ► ASSUME: The Universe size |U| is known
The Main Algorithmic Result ► The RSS Algorithm ► Space Complexity ► Update In every transaction in O(space) time ► Estimation On demand in O(space) time ► One Time pass
Dyadic Intervals ► Log(|U|)+1 resolution levels j ► 2|U|-1 Dyadic intervals I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)
Arbitrary intervals ► Any Interval can be displayed as a disjoint union of at most log(|U|) dyadic intervals ► For example A[0,6] = I(1,0)+I(2,2)+I(3,6) ► Intervals starting at 0 will not use the same resolution twice I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)
Computing quantiles ► Assuming we have the number of records in each dyadic interval, We can efficiently compute any arbitrary interval in A. ► To compute the ф-quantile for any k, we need a j k s.t.: A[0,j k ) < kФN < A[0,j k+1 ) ► Use binary search to find it. ► Keeping all intervals is costly (O(|U|))
Random Subset Sums ► In case j = log(|U|) ► Let S be a subset of U ► Each u U has p=½ of being in S ► E(|S|)= ½|U| ► Define: ► E(|A S |)=½||A||=½N
Estimating A[i]
Improvement ► Instead of keeping sets of point dyadic sets, Keep random sets of all resolutions ► We need a method of keeping a Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|) ► Instead of keeping the sets keep a small representation of them
Pseudorandom set generator ► We need to keep a small representation of a random set S (Ui S with p= ½) ► Given a seed of size log(|U|)+1 ► Represent a set S of size o(|U|) ► Quickly test if i S or not ► Use Extended Hamming Code
Extended Hamming Code ► Given a seed, tells whether the i S ► For example: |U| = 8 Seed size: log|U|+1 = 4 G(seed, i) = seed X i’th column mod 2 ► Efficient to compute ► 3-wise disjoint
The Data Structure ► For each resolution level j keep num_copies random subsets S of all dyadic intervals in that level (we only keep the representation seed) ► Keep ► Maintain N = ||A|| ► We got S 1,…,S num_copies per level
Upon Transactions ► Insert(i) / Delete(i) For Each resolution level j ► Locate the single I j,k into which i falls (high order binary bits) ► Determine all S ℓ containing I j,k ► For Each S ℓ increase/Decrease ||A S ℓ || by 1
Estimating Quantiles: Dyadic Intervals ► Given a dyadic interval I=I j,k ► There are num_copies sets of resolution j G E G E ► Quickly test each S ℓ and check if I S ℓ and if so estimate ► Group all estimations into G groups of E elements ► For each group g calculate the average of all estimations A g,j,k
Estimating Quantiles: Arbitrary intervals ► Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals I j,k ► Form G groups and calculate each group’s sum of all dyadic interval’s A g,j,k for all I j,k comprising I. ► Take the median of all G groups as the final estimate of A I ► Its more convenient to refer to the result as an overestimate |A I |≤|A I | ~ ≤|A I |+εN
3 dyadic intervals E = 4 Elements per group G = 3 Groups SUM AVERAGE MEDIAN The Interval’s Estimate
Analysis ► Lemma: The algorithm estimates each quantile to within εN with p>1-δ ► Proof: For a fixed resolution level j, Let Then:
Analysis (cont.)
► We take G copies of Z and take the median. ► By the Chernoff inequality, ► The binary search looked for a j k such that ► We made log|U| checks in the binary search ► The probability any of them failed is log|U| times what we achieved, i.e δ
RSS Properties ► The algorithm may return a quantile value which was not seen in the input ► Changing the order of insertions and deletions doesn’t affect results ► The RSSs are composable: U can be split to many disjoint ranges and some pre-agreed common random subsets
Extension: U is unknown ► Predict a range [0, u-1] for U. ► Upon insertion of i > u-1, add another instance of RSS with range [u, u 2 -1], and so on… ► Because RSS is composable, we only have to join the result upon query ► Increased cost factor: log 2 log(|U|).
Experiments ► What is the median length of all active AT&T calls ? ► When call Starts: Add timestamp Ends: Delete start timestamp ► 4 KB used for RSS ► Compared RSS GK GK2
Number of Active Phone Calls Over Time
Error in Computation of Median Over Time
Average Error for Last 50 Snapshots, For Deciles
The End