1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss.

Slides:



Advertisements
Similar presentations
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Advertisements

Estimation of Means and Proportions
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Order Statistics Sorted
Fast Algorithms For Hierarchical Range Histogram Constructions
Mean, Proportion, CLT Bootstrap
Dynamic Planar Convex Hull Operations in Near- Logarithmic Amortized Time TIMOTHY M. CHAN.
Analysis of Algorithms
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Median/Order Statistics Algorithms
Dictionaries and Hash Tables1  
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Complexity Analysis (Part I)
Sample size computations Petter Mostad
Skip Lists Michael Oudshoorn. CS351 Software Engineering (AY05)2 Skip Lists Binary Search Trees: O(log n) –If, and only if, the tree is “balanced” Insertion.
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Quantitative Methods – Week 6: Inductive Statistics I: Standard Errors and Confidence Intervals Roman Studer Nuffield College
Evaluating Hypotheses
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Program Performance & Asymptotic Notations CSE, POSTECH.
Virtual COMSATS Inferential Statistics Lecture-6
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 9 Section 1 – Slide 1 of 39 Chapter 9 Section 1 The Logic in Constructing Confidence Intervals.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
STAT 111 Introductory Statistics Lecture 9: Inference and Estimation June 2, 2004.
CS 3343: Analysis of Algorithms
Data Structures Week 6 Further Data Structures The story so far  We understand the notion of an abstract data type.  Saw some fundamental operations.
Analysis of Algorithms
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Algorithms and their Applications CS2004 ( ) Dr Stephen Swift 3.1 Mathematical Foundation.
Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.
Algorithm Analysis (Algorithm Complexity). Correctness is Not Enough It isn’t sufficient that our algorithms perform the required tasks. We want them.
CSC 211 Data Structures Lecture 13
Confidence intervals and hypothesis testing Petter Mostad
How to Summarize the Universe: Dynamic Maintenance of Quantiles Gilbert, Kotidis, Muthukrishnan, Strauss Presented by Itay Malinger December 2003.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Review 1 Arrays & Strings Array Array Elements Accessing array elements Declaring an array Initializing an array Two-dimensional Array Array of Structure.
Machine Learning Chapter 5. Evaluating Hypotheses
Sorting: Implementation Fundamental Data Structures and Algorithms Klaus Sutner February 24, 2004.
Computer and Robot Vision II Chapter 20 Accuracy Presented by: 傅楸善 & 王林農 指導教授 : 傅楸善 博士.
Sampling and estimation Petter Mostad
Data Structures Haim Kaplan & Uri Zwick December 2013 Sorting 1.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
1 Sorting an almost sorted input Suppose we know that the input is “almost” sorted Let I be the number of “inversions” in the input: The number of pairs.
Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Statistics for Business and Economics 8 th Edition Chapter 7 Estimation: Single Population Copyright © 2013 Pearson Education, Inc. Publishing as Prentice.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Assignment 5 is posted. Exercise 8 is very similar to what you will be doing with assignment 5. Exam.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
SIMILARITY SEARCH The Metric Space Approach
A paper on Join Synopses for Approximate Query Answering
Sorting We have actually seen already two efficient ways to sort:
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Presentation transcript:

1 How to Summarize the Universe: Dynamic Maintenance of Quantiles By: Anna C. Gilbert Yannis Kotidis S. Muthukrishnan Martin J. Strauss

2 Quantiles Median, quartiles, … The general case: Uses Statistics Estimating result set size Partitioning …

3 Computing static quantiles Blum, Floyd, Pratt, Rivest & Tarjan Find the i ’ th element Comparison based Similar to QuickSort O(n) – worst case time

4 Problems with massive data sets O(n) time – not good enough … O(n) space – usually not affordable Dynamic environment Cancellations are especially troublesome Usually recomputed periodically May be very inaccurate until recomputed Some kind of approximation is the only choice ! …

5 Common approaches Deterministically chosen sample Randomization – probability of failure Maintaining a backing sample Wavelets Most of the above approaches work well for the incremental case, but deletions may cause inaccuracy.

6 GK – Greenwald-Khanna ( ‘ 01) Fill the available memory with values Maintain rank ranges on values is memory. When a new value is inserted, kick a value out of memory. Insert-only algorithm Can be extended to support deletes ( “ GK2 ” ). Maintain two instances – one for insertions and one for deletions.

7 Maintenance of Equi-Depth Histograms (using a backing sample) Gibbons, Matias, Poosala – ’ 97 Scan the dataset and choose values for the sample using the “ reservoir ” method. Treat insertions as a “ continuous ” scan. When a deletion from the sample is necessary – rescan only if number of items drops below a specified minimum. Works well for a mostly-insertions enviornment.

8 The authors ’ main result The RSS algorithm RSS – Random Subset Sum Space – polylogarithmic in universe size Proportional time A priori guarantee of accuracy within a user specified error ε, with a user specified probability of failure δ.

9 Some formalism … The universe: U = {0, …, |U |-1} Number of tuples in data set: ||A||=N Data set can be thought of as an array: A[i] – number of tuples with value i Our goal for computing Ф-quantiles – find a j k such that:

10 Some assumptions The universe ’ s size is known Later we ’ ll throw that assumption away Update = Delete + Insert

11 Computing quantiles Let ’ s say A[i] is known for every i. Easy to maintain through updates Summing up array items ? Not a very good complexity …

12 Computing quantiles (cont.) We need a method of reducing summation overhead. We should be able to compute any sum of items in A in logarithmic time. The solution: Keeping computed sums of intervals.

13 Dyadic intervals - defined Atomic dyadic interval – a single point. I j,k = [k*2 log(|U|)-j,(k+1)*2 log(|U|)-j -1] j – resolution level Example: I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)

14 Let ’ s say we have sums for all dyadic intervals as in the above example. We want to compute A[0,6]. A[0,6] = I(1,0) + I(2,2) + I(3,6) Computing an arbitrary interval I(3,0)I(3,1)I(3,2)I(3,3)I(3,4)I(3,5)I(3,6)I(3,7) I(2,0)I(2,1)I(2,2)I(2,3) I(1,0)I(1,1) I(0,0)

15 Dyadic intervals - observations Log(|U|) + 1 resolution levels 2|U| - 1 dyadic intervals altogether O(|U|) space needed to keep them all O(log(|U|)) time needed to compute any arbitrary interval.

16 Computing quantiles (Cont.) We can now efficiently compute any arbitrary interval in A. A ф-quantile for any k can be computed thus: We need a j k s.t.: A[0,j k ) < kФN < a[0,j k+1 ) Use binary search to find it !

17 But … Keeping O(|U|) of data presents a real space complexity problem. We need a way of estimating A[i] on demand. … And also of estimating any dyadic interval on demand.

18 Introducing random sets Let S be a random set of values from U. Each value has a probability of ½ of being in S. Expectation of the number of items in S is ½ |U|.

19 Random subset sums Define ||A S || as the number of items in A with values in S. Expectation of ||A S || is ½ ||A||= ½ N. Now consider only subsets S containing a certain value i.

20 Random subset sums (cont.) Suppose we keep a number of random sets S, each containing random values from U – each with probability ½. We maintain ||A S || for each such set. Easy to maintain during updates. How can we now estimate A[i] ?

21 Random subset sums (cont.) We can estimate A[i] for any i with: A[i] = 2||A S || - ||A|| Proof: The authors prove that repeating the process O(1/ε 2 ) times yields the required accuracy.

22 Random subset sums (cont.) We can also estimate any dyadic interval I j,k using the same method. Improvement: We can compute the sums for dyadic intervals from a certain level. We can now estimate any arbitrary interval in the universe …

23 Space Considerations Keeping a set of expected size ½ |U| is still O(|U|). We need a method of “ keeping ” a set without actually keeping it … The technique: instead of sets, keep random seeds of size o(log|U|) bits and compute whether a given iєS on demand.

24 Extended Hamming Code Used for generating the random sets. Provides sufficient “ randomness ” For example: |U| = 8 Seed size: log|U|+1 = 4 G(seed, i) = seed X i ’ th column

25 RSS Algorithm Summary To compute a dyadic interval. Compute 2||A S || - ||A|| for sets containing the given dyadic interval. To compute an arbitrary interval. Write it as a disjoint union of dyadic intervals, estimate them and take a median over possible results (simplified). To compute the quantiles. Use binary search and compute the intervals until found.

26 Algorithm Complexity Claim The RSS algorithm ’ s space complexity (for t quantile queries): Time complexity for inserts, deletes and computing each quantile on demand is proportional to the space used.

27 Proof Outline Declare random variable X k =2||A Ik || if I k is in S and 0 otherwise X – Sum of all X k ’ s in a certain set Y – Sum of all X ’ s in a given interval Z – A number of repetitions of X.

28 Proof Outline (Cont.) In a similar fashion to previous slides, show that Y and ||A|| can be used to compute ||A I ||. Compute the variance. Use Chebyshev ’ s and then Chernoff ’ s inequalities, together with the computed variance, to achieve the required result.

29 What If U Is Unknown ? In practice, the universe U is not always known. Predict a range [0, u-1] for U. Given an inserted (or updated) value i s.t. (i > u-1), add another instance of RSS with range [u, u 2 -1], and so on … Estimating dyadic intervals can be done in a single instance of RSS. Increased cost factor: log 2 log(|U|).

30 Some RSS Properties RSS may return as a quantile a value which is not really in the dataset. Order of insertions and deletions does not affect result and accuracy. Can be parallelized quite easily (as long as random subsets are pre-agreed).

31 Experimental Results Experiments Static artificial dataset Dynamic artificial dataset Dynamic real dataset Participants Na ï ve[l] RSS[l] GK GK2 – an improvement for GK

32 Static Artificial Dataset |U| = 2 20 Compute 15 quantiles at position (1/16)k for k = 1,2, …,15. 3 different distributions Uniform Zipf Normal[m,v] Algorithm used: RSS[7] (11K footprint).

33 Errors for Zipf data

34 Errors for Normal[U/2, U/50] Distribution

35 Dynamic Artificial Dataset Insert N=104,858 items from uniform dist. D1=Uni[1,U], U=2 20. Insert αN more items from uniform dist. D2=Uni[U/2-U/32, U/2+U/32]. Delete all values from the first insertion. Parameter α controls the mass of the second insertion with respect to the first.

36 Dynamic Artificial Dataset Results

37 Dynamic Real Dataset Based on true Call Detail Records (CDRs) from AT&T. Dataset used includes 4.42 million CDRs covering a period of 18 hours. Objective: find the median length of current calls. Probe for estimates every 10,000 records. Algorithm used: RSS[6] (4K footprint).

38 Number of Active Phone Calls Over Time

39 Error in Computation of Median Over Time

40 Average Error for Last 50 Snapshots, For Deciles

41 Conclusions – RSS Algorithm for maintaining dynamic quantiles. Works well (within a user-defined precision) both for insertions AND deletions. Polylogarithmic (in universe size) in space and time complexities.

42