Download presentation
Presentation is loading. Please wait.
Published byArabella Long Modified over 9 years ago
1
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003
2
Problem Definition ► The Universe: U = {0, …, |U |-1} ► Number of records in data set: ||A||=N ► Data set can be thought of as an array: A[i] – number of records with value i ► A S – number of records with values in S ► The Ф-quantile of an ordered sequence of N data items are the value with rank ► Our goal is computing ε-approximate Ф-quantiles – find a j k such that:
4
Transactions ► Insert(i): A[i] A[i] + 1 ► Delete(i): A[i] A[i] – 1 ► Let ► ASSUME: The Universe size |U| is known
5
The Main Algorithmic Result ► The RSS Algorithm ► Space Complexity ► Update In every transaction in O(space) time ► Estimation On demand in O(space) time ► One Time pass
6
Dyadic Intervals ► Log(|U|)+1 resolution levels j ► 2|U|-1 Dyadic intervals 01234567 I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)
7
Arbitrary intervals ► Any Interval can be displayed as a disjoint union of at most log(|U|) dyadic intervals ► For example A[0,6] = I(1,0)+I(2,2)+I(3,6) ► Intervals starting at 0 will not use the same resolution twice 01234567 I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)
8
Computing quantiles ► Assuming we have the number of records in each dyadic interval, We can efficiently compute any arbitrary interval in A. ► To compute the ф-quantile for any k, we need a j k s.t.: A[0,j k ) < kФN < A[0,j k+1 ) ► Use binary search to find it. ► Keeping all intervals is costly (O(|U|))
9
Random Subset Sums ► In case j = log(|U|) ► Let S be a subset of U ► Each u U has p=½ of being in S ► E(|S|)= ½|U| ► Define: ► E(|A S |)=½||A||=½N
10
Estimating A[i]
11
Improvement ► Instead of keeping sets of point dyadic sets, Keep random sets of all resolutions ► We need a method of keeping a Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|) ► Instead of keeping the sets keep a small representation of them
12
Pseudorandom set generator ► We need to keep a small representation of a random set S (Ui S with p= ½) ► Given a seed of size log(|U|)+1 ► Represent a set S of size o(|U|) ► Quickly test if i S or not ► Use Extended Hamming Code
13
Extended Hamming Code ► Given a seed, tells whether the i S ► For example: |U| = 8 Seed size: log|U|+1 = 4 G(seed, i) = seed X i’th column mod 2 ► Efficient to compute ► 3-wise disjoint
14
The Data Structure ► For each resolution level j keep num_copies random subsets S of all dyadic intervals in that level (we only keep the representation seed) ► Keep ► Maintain N = ||A|| ► We got S 1,…,S num_copies per level
15
Upon Transactions ► Insert(i) / Delete(i) For Each resolution level j ► Locate the single I j,k into which i falls (high order binary bits) ► Determine all S ℓ containing I j,k ► For Each S ℓ increase/Decrease ||A S ℓ || by 1
16
Estimating Quantiles: Dyadic Intervals ► Given a dyadic interval I=I j,k ► There are num_copies sets of resolution j G E G E ► Quickly test each S ℓ and check if I S ℓ and if so estimate ► Group all estimations into G groups of E elements ► For each group g calculate the average of all estimations A g,j,k
17
Estimating Quantiles: Arbitrary intervals ► Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals I j,k ► Form G groups and calculate each group’s sum of all dyadic interval’s A g,j,k for all I j,k comprising I. ► Take the median of all G groups as the final estimate of A I ► Its more convenient to refer to the result as an overestimate |A I |≤|A I | ~ ≤|A I |+εN
18
3 dyadic intervals E = 4 Elements per group G = 3 Groups SUM AVERAGE MEDIAN The Interval’s Estimate
19
Analysis ► Lemma: The algorithm estimates each quantile to within εN with p>1-δ ► Proof: For a fixed resolution level j, Let Then:
21
Analysis (cont.)
22
► We take G copies of Z and take the median. ► By the Chernoff inequality, ► The binary search looked for a j k such that ► We made log|U| checks in the binary search ► The probability any of them failed is log|U| times what we achieved, i.e δ
23
RSS Properties ► The algorithm may return a quantile value which was not seen in the input ► Changing the order of insertions and deletions doesn’t affect results ► The RSSs are composable: U can be split to many disjoint ranges and some pre-agreed common random subsets
24
Extension: U is unknown ► Predict a range [0, u-1] for U. ► Upon insertion of i > u-1, add another instance of RSS with range [u, u 2 -1], and so on… ► Because RSS is composable, we only have to join the result upon query ► Increased cost factor: log 2 log(|U|).
25
Experiments ► What is the median length of all active AT&T calls ? ► When call Starts: Add timestamp Ends: Delete start timestamp ► 4 KB used for RSS ► Compared RSS GK GK2
26
Number of Active Phone Calls Over Time
27
Error in Computation of Median Over Time
28
Average Error for Last 50 Snapshots, For Deciles
29
The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.