End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison

Problem description Estimating join size Not restricted to key-foreign key joins Based on summaries of the two tables computed separately Two main contributions of this paper Proposing a new type of summaries based on a special type of sampling Extensive experimental comparison of many types of summaries

We can get more accurate estimates! [AGMS99] showed that on certain data sets All summaries give inaccurate estimates Estimates based on random sampling are within constant factor of bound We show that On other data sets, our estimates significantly more accurate than those with random sampling No known summaries give estimates more accurate than all others for every data set

Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms

Building the end-biased samples If frequency of every value known for both tables → exact join size We keep a sample of this data Sampling probability proportional to frequency [DLT01] Sampling decisions correlated by using a shared hash function [F90],[DG00],[EKMV04] Frequency of values of join attribute in table A Frequency of values of join attribute in table B (c,10) (g,1) (m,2) (s,5) (t,1) (d,1) (g,1) (m,1) (r,7) (z,1) Sampling threshold T=4 p=1 p=0.25 p=0.5 p=1 p=0.25 p=1 p=0.25

Estimating join size Let a v be the frequency of value v in table A, b v in B and p v the probability that v is selected into both samples Sum contribution of values in both samples (a v b v /p v ) to estimate join size If a v ≥T a and b v ≥T b, p v =1 If a v ≥T a and b v <T b, p v =b v /T b If a v <T a and b v ≥T b, p v =a v /T a If a v <T a and b v <T b, p v =min(a v /T a,b v /T b )

Why correlate the samples? Example: tables with 1000 values appearing once, 50 values common to both tables We sample with probability 1/10 Sample size ~ 100 for each table Comparison p v Common values sampled Join size estimate CorrelatedUncorrel. 0.10.01 ~ 4, 5 or 6 ~ 0 or 1 40,50 or 600 or 100

Comparison of sampling methods Type of values dominating the join Accuracy of estimates of join size Random sampling Counting samples End-biased samples Frequent in both relations GoodVery goodPerfect Frequent in one relation Bad Infrequent in both relations Bad Good

Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms

Experimental methodology Randomly generated tables with ~ 1,000,000 tuples Explored multiple configurations Varied the “peakedness” of the distribution Varied memory budget from 204 to 659,456 words Varied the amount of correlation between tables Uncorrelated – tables generated independently Positively correlated – frequent values likely same in both tables Negatively correlated – unlikely frequent values same in the two tables 1,000 runs for each configuration

Summaries compared End-biased samples End-biased equi-depth histograms [PC84] Sketches [AGMS99],[DGGR02],[GGR04] Concise samples [GM98] Counting samples [GM98]

Comparison with histograms

Comparison with sketches

Memory comparison

Qualitative comparison AdvantageSketches End-biased samples Streaming updates X Simple configuration X Selection on join attribute X

Conclusions End-biased samples and sketches are the best summaries for the join size estimation problem addressed in this paper End-biased samples are compelling if Selections on the join attribute are required Summaries must be very concise The frequencies of join attributes in the two tables are strongly correlated

Questions? Thank you! Scripts and results for experiments available at http://www.cs.wisc.edu/~estan/ebs.tar.gz

Estimating the join size

Related work – sampling methods [GM98] concise samples, counting samples [DLT01] smart sampling [F90],[EKMV04] using a hash function to select values used as summary of data

Related work – join size estimation Histograms Multidimensional histograms [GG02],[GK04] Wavelets [AGMS99],[DGGR02],[GGR04] Sketches

Variance of join size estimate No slide, point to the paper.

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

Similar presentations

Presentation on theme: "End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

Similar presentations

Presentation on theme: "End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison."— Presentation transcript:

Similar presentations

About project

Feedback