Presentation is loading. Please wait.

Presentation is loading. Please wait.

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

Similar presentations


Presentation on theme: "End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison."— Presentation transcript:

1 End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison

2 Problem description Estimating join size Not restricted to key-foreign key joins Based on summaries of the two tables computed separately Two main contributions of this paper Proposing a new type of summaries based on a special type of sampling Extensive experimental comparison of many types of summaries

3 We can get more accurate estimates! [AGMS99] showed that on certain data sets All summaries give inaccurate estimates Estimates based on random sampling are within constant factor of bound We show that On other data sets, our estimates significantly more accurate than those with random sampling No known summaries give estimates more accurate than all others for every data set

4 Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms

5 Building the end-biased samples If frequency of every value known for both tables → exact join size We keep a sample of this data Sampling probability proportional to frequency [DLT01] Sampling decisions correlated by using a shared hash function [F90],[DG00],[EKMV04] Frequency of values of join attribute in table A Frequency of values of join attribute in table B (c,10) (g,1) (m,2) (s,5) (t,1) (d,1) (g,1) (m,1) (r,7) (z,1) Sampling threshold T=4 p=1 p=0.25 p=0.5 p=1 p=0.25 p=1 p=0.25

6 Estimating join size Let a v be the frequency of value v in table A, b v in B and p v the probability that v is selected into both samples Sum contribution of values in both samples (a v b v /p v ) to estimate join size If a v ≥T a and b v ≥T b, p v =1 If a v ≥T a and b v <T b, p v =b v /T b If a v <T a and b v ≥T b, p v =a v /T a If a v <T a and b v <T b, p v =min(a v /T a,b v /T b )

7 Why correlate the samples? Example: tables with 1000 values appearing once, 50 values common to both tables We sample with probability 1/10 Sample size ~ 100 for each table Comparison p v Common values sampled Join size estimate CorrelatedUncorrel. 0.10.01 ~ 4, 5 or 6 ~ 0 or 1 40,50 or 600 or 100

8 Comparison of sampling methods Type of values dominating the join Accuracy of estimates of join size Random sampling Counting samples End-biased samples Frequent in both relations GoodVery goodPerfect Frequent in one relation Bad Infrequent in both relations Bad Good

9 Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms

10 Experimental methodology Randomly generated tables with ~ 1,000,000 tuples Explored multiple configurations Varied the “peakedness” of the distribution Varied memory budget from 204 to 659,456 words Varied the amount of correlation between tables Uncorrelated – tables generated independently Positively correlated – frequent values likely same in both tables Negatively correlated – unlikely frequent values same in the two tables 1,000 runs for each configuration

11 Summaries compared End-biased samples End-biased equi-depth histograms [PC84] Sketches [AGMS99],[DGGR02],[GGR04] Concise samples [GM98] Counting samples [GM98]

12 Comparison with histograms

13 Comparison with sketches

14 Memory comparison

15 Qualitative comparison AdvantageSketches End-biased samples Streaming updates X Simple configuration X Selection on join attribute X

16 Conclusions End-biased samples and sketches are the best summaries for the join size estimation problem addressed in this paper End-biased samples are compelling if Selections on the join attribute are required Summaries must be very concise The frequencies of join attributes in the two tables are strongly correlated

17 Questions? Thank you! Scripts and results for experiments available at http://www.cs.wisc.edu/~estan/ebs.tar.gz

18 Estimating the join size

19 Related work – sampling methods [GM98] concise samples, counting samples [DLT01] smart sampling [F90],[EKMV04] using a hash function to select values used as summary of data

20 Related work – join size estimation Histograms Multidimensional histograms [GG02],[GK04] Wavelets [AGMS99],[DGGR02],[GGR04] Sketches

21 Variance of join size estimate No slide, point to the paper.


Download ppt "End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison."

Similar presentations


Ads by Google