How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.

How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003

Problem Definition ► The Universe: U = {0, …, |U |-1} ► Number of records in data set: ||A||=N ► Data set can be thought of as an array: A[i] – number of records with value i ► A S – number of records with values in S ► The Ф-quantile of an ordered sequence of N data items are the value with rank ► Our goal is computing ε-approximate Ф-quantiles – find a j k such that:

Transactions ► Insert(i): A[i]  A[i] + 1 ► Delete(i): A[i]  A[i] – 1 ► Let ► ASSUME: The Universe size |U| is known

The Main Algorithmic Result ► The RSS Algorithm ► Space Complexity ► Update In every transaction in O(space) time ► Estimation On demand in O(space) time ► One Time pass

Dyadic Intervals ► Log(|U|)+1 resolution levels j ► 2|U|-1 Dyadic intervals 01234567 I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)

Arbitrary intervals ► Any Interval can be displayed as a disjoint union of at most log(|U|) dyadic intervals ► For example A[0,6] = I(1,0)+I(2,2)+I(3,6) ► Intervals starting at 0 will not use the same resolution twice 01234567 I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)

Computing quantiles ► Assuming we have the number of records in each dyadic interval, We can efficiently compute any arbitrary interval in A. ► To compute the ф-quantile for any k, we need a j k s.t.: A[0,j k ) < kФN < A[0,j k+1 ) ► Use binary search to find it. ► Keeping all intervals is costly (O(|U|))

Random Subset Sums ► In case j = log(|U|) ► Let S be a subset of U ► Each u  U has p=½ of being in S ► E(|S|)= ½|U| ► Define: ► E(|A S |)=½||A||=½N

Estimating A[i]

Improvement ► Instead of keeping sets of point dyadic sets, Keep random sets of all resolutions ► We need a method of keeping a Random set of j-resolution dyadic intervals (keeping it explicitly is o(|U|) ► Instead of keeping the sets keep a small representation of them

Pseudorandom set generator ► We need to keep a small representation of a random set S (Ui  S with p= ½) ► Given a seed of size log(|U|)+1 ► Represent a set S of size o(|U|) ► Quickly test if i  S or not ► Use Extended Hamming Code

Extended Hamming Code ► Given a seed, tells whether the i  S ► For example:  |U| = 8  Seed size: log|U|+1 = 4  G(seed, i) = seed X i’th column mod 2 ► Efficient to compute ► 3-wise disjoint

The Data Structure ► For each resolution level j keep num_copies random subsets S of all dyadic intervals in that level (we only keep the representation seed) ► Keep ► Maintain N = ||A|| ► We got S 1,…,S num_copies per level

Upon Transactions ► Insert(i) / Delete(i)  For Each resolution level j ► Locate the single I j,k into which i falls (high order binary bits) ► Determine all S ℓ containing I j,k ► For Each S ℓ increase/Decrease ||A S ℓ || by 1

Estimating Quantiles: Dyadic Intervals ► Given a dyadic interval I=I j,k ► There are num_copies sets of resolution j G E G E ► Quickly test each S ℓ and check if I  S ℓ and if so estimate ► Group all estimations into G groups of E elements ► For each group g calculate the average of all estimations A g,j,k

Estimating Quantiles: Arbitrary intervals ► Given an interval I, Write it as a disjoint union of at most log(|U|) dyadic intervals I j,k ► Form G groups and calculate each group’s sum of all dyadic interval’s A g,j,k for all I j,k comprising I. ► Take the median of all G groups as the final estimate of A I ► Its more convenient to refer to the result as an overestimate |A I |≤|A I | ~ ≤|A I |+εN

3 dyadic intervals E = 4 Elements per group G = 3 Groups SUM AVERAGE MEDIAN The Interval’s Estimate

Analysis ► Lemma: The algorithm estimates each quantile to within εN with p>1-δ ► Proof:  For a fixed resolution level j, Let  Then:

Analysis (cont.)

► We take G copies of Z and take the median. ► By the Chernoff inequality, ► The binary search looked for a j k such that ► We made log|U| checks in the binary search ► The probability any of them failed is log|U| times what we achieved, i.e δ

RSS Properties ► The algorithm may return a quantile value which was not seen in the input ► Changing the order of insertions and deletions doesn’t affect results ► The RSSs are composable: U can be split to many disjoint ranges and some pre-agreed common random subsets

Extension: U is unknown ► Predict a range [0, u-1] for U. ► Upon insertion of i > u-1, add another instance of RSS with range [u, u 2 -1], and so on… ► Because RSS is composable, we only have to join the result upon query ► Increased cost factor: log 2 log(|U|).

Experiments ► What is the median length of all active AT&T calls ? ► When call  Starts: Add timestamp  Ends: Delete start timestamp ► 4 KB used for RSS ► Compared  RSS  GK  GK2

Number of Active Phone Calls Over Time

Error in Computation of Median Over Time

Average Error for Last 50 Snapshots, For Deciles

The End

How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.

Similar presentations

Presentation on theme: "How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.

Similar presentations

Presentation on theme: "How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003."— Presentation transcript:

Similar presentations

About project

Feedback