Download presentation
Presentation is loading. Please wait.
Published byEdwin Greene Modified over 8 years ago
1
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison
2
Problem description Estimating join size Not restricted to key-foreign key joins Based on summaries of the two tables computed separately Two main contributions of this paper Proposing a new type of summaries based on a special type of sampling Extensive experimental comparison of many types of summaries
3
We can get more accurate estimates! [AGMS99] showed that on certain data sets All summaries give inaccurate estimates Estimates based on random sampling are within constant factor of bound We show that On other data sets, our estimates significantly more accurate than those with random sampling No known summaries give estimates more accurate than all others for every data set
4
Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms
5
Building the end-biased samples If frequency of every value known for both tables → exact join size We keep a sample of this data Sampling probability proportional to frequency [DLT01] Sampling decisions correlated by using a shared hash function [F90],[DG00],[EKMV04] Frequency of values of join attribute in table A Frequency of values of join attribute in table B (c,10) (g,1) (m,2) (s,5) (t,1) (d,1) (g,1) (m,1) (r,7) (z,1) Sampling threshold T=4 p=1 p=0.25 p=0.5 p=1 p=0.25 p=1 p=0.25
6
Estimating join size Let a v be the frequency of value v in table A, b v in B and p v the probability that v is selected into both samples Sum contribution of values in both samples (a v b v /p v ) to estimate join size If a v ≥T a and b v ≥T b, p v =1 If a v ≥T a and b v <T b, p v =b v /T b If a v <T a and b v ≥T b, p v =a v /T a If a v <T a and b v <T b, p v =min(a v /T a,b v /T b )
7
Why correlate the samples? Example: tables with 1000 values appearing once, 50 values common to both tables We sample with probability 1/10 Sample size ~ 100 for each table Comparison p v Common values sampled Join size estimate CorrelatedUncorrel. 0.10.01 ~ 4, 5 or 6 ~ 0 or 1 40,50 or 600 or 100
8
Comparison of sampling methods Type of values dominating the join Accuracy of estimates of join size Random sampling Counting samples End-biased samples Frequent in both relations GoodVery goodPerfect Frequent in one relation Bad Infrequent in both relations Bad Good
9
Overview End-biased samples Theoretical comparison against other sampling-based methods Experimental comparison against sketches and histograms
10
Experimental methodology Randomly generated tables with ~ 1,000,000 tuples Explored multiple configurations Varied the “peakedness” of the distribution Varied memory budget from 204 to 659,456 words Varied the amount of correlation between tables Uncorrelated – tables generated independently Positively correlated – frequent values likely same in both tables Negatively correlated – unlikely frequent values same in the two tables 1,000 runs for each configuration
11
Summaries compared End-biased samples End-biased equi-depth histograms [PC84] Sketches [AGMS99],[DGGR02],[GGR04] Concise samples [GM98] Counting samples [GM98]
12
Comparison with histograms
13
Comparison with sketches
14
Memory comparison
15
Qualitative comparison AdvantageSketches End-biased samples Streaming updates X Simple configuration X Selection on join attribute X
16
Conclusions End-biased samples and sketches are the best summaries for the join size estimation problem addressed in this paper End-biased samples are compelling if Selections on the join attribute are required Summaries must be very concise The frequencies of join attributes in the two tables are strongly correlated
17
Questions? Thank you! Scripts and results for experiments available at http://www.cs.wisc.edu/~estan/ebs.tar.gz
18
Estimating the join size
19
Related work – sampling methods [GM98] concise samples, counting samples [DLT01] smart sampling [F90],[EKMV04] using a hash function to select values used as summary of data
20
Related work – join size estimation Histograms Multidimensional histograms [GG02],[GK04] Wavelets [AGMS99],[DGGR02],[GGR04] Sketches
21
Variance of join size estimate No slide, point to the paper.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.