Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto 2 IBM Toronto Lab
November 3, 2005CIKM Distinct value combinations CountryCityHotel Name GermanyBremenHilton GermanyBremenBest Western GermanyFrankfurtInterCity CanadaTorontoFour Seasons CanadaTorontoIntercontinental 3 distinct value combinations COLSCARD (COlumn Set CARDinality) = 3 The problem: estimating COLSCARD for a given set of attributes
November 3, 2005CIKM Motivation Cardinality estimation for query optimization, e.g., Estimating the size of Estimating the size of the aggregation Approximate query answering, e.g., COUNT queries SELECT sales_date, sales_person, SUM(sales_quantity) AS unit_sold FROM sales GROUP BY sales_date, sales_person
November 3, 2005CIKM Roadmap Related work Estimation with known marginal distributions Upper/lower bounds An estimator Estimation with histograms Experimental results Conclusions
November 3, 2005CIKM Related work Previous work has focused on the case of single attribute. [H Ö T88],[H Ö T89],[HNSS ’ 95],[HS ’ 98],[CCMN ’ 00] Sampling approach is used. Estimation through sampling is difficult [CCMN ’ 00] No existing statistical information is exploited.
November 3, 2005CIKM Our solution Considering multiple-attributes Utilizing existing statistics on individual attributes Readily available in most database systems Does not require access to the data Granularity of statistics Exact marginal frequency distributions Approximate distributions: histograms etc.
November 3, 2005CIKM Estimation with known marginals Number of distinct values in attribute Ai, frequency vector CountryCityHotel Name GermanyBremenHilton GermanyBremenBest Western GermanyFrankfurtInterCity CanadaTorontoFour Seasons CanadaTorontoIntercontinental
November 3, 2005CIKM The na ï ve estimator COLSCARD = Number of possible value combinations d i : the number of distinct values in attribute A i Sanity bound: COLSCARD cannot be greater than the table size The problem: Some value combinations with low occurrence probabilities may not appear in the table!
November 3, 2005CIKM Upper/Lower bounds Trivial bounds Upper bound: (the na ï ve estimator) Lower bound: Tighter bounds? In the case of two attributes, tighter bounds are available.
November 3, 2005CIKM Tighter bounds N = d e f a b c A2A2 A1A1 Naïve bounds: 3, 9Lower bound = = value freq valuefreq [2, 3] Upper bound = = 5
November 3, 2005CIKM Expected number of combinations Assumptions 1.The data distributions of individual columns are independent 2.The occurrence of each combination in the table is independent Each element of f represents the frequency of a specific value combination. An estimate of the probability of occurrence
November 3, 2005CIKM Estimator The probability of the i-th combination not appearing in a particular tuple is The probability of the i-th combination not appearing in the table (of size N) is The expected number of value combinations is
November 3, 2005CIKM Example revisited Estimate the COLSCARD for attribute set (A 1, A 2, A 3 ), given New estimate: 5.94 Na ï ve estimate: 3*2*2 = 12
November 3, 2005CIKM Roadmap Related work Estimation with known marginal distributions Upper/lower bounds An estimator Estimation with histograms Experimental results Conclusions
November 3, 2005CIKM Estimation with histograms Histograms exist on individual attributes Two classes of histograms Partition-based End-biased Marginals can be (approximately) reconstructed from histograms Optimal histograms in each class?
November 3, 2005CIKM Optimal histograms Minimizing the error incurred by histograms ERR = |EST hist – EST exact | Partition-based histograms A dynamic programming algorithm similar to that for V-optimal histogram construction [Jagadish et al. 98] can be used.
November 3, 2005CIKM Optimal end-biased histograms An end-biased histogram with B buckets stores The exact frequencies of B-1 attribute values The average of the remaining values Which B-1 values to store exactly? Most widely used end-biased histograms store the frequencies of the most frequent values Not always optimal for COLSCARD estimation!!
November 3, 2005CIKM Example Attributes (A1, A2) Choose 1 frequency to store exactly Index of the frequency stored Error table N=10
November 3, 2005CIKM Optimal end-biased histograms Exhaustive search takes time proportional to We prove that the optimal choices can be one of the following Most frequent values Least frequent values A combination of most frequent and least frequent values Only need to search both ends Cost is linear in B, independent of d j !
November 3, 2005CIKM Roadmap Related work Estimation with known marginal distributions Upper/lower bounds An estimator Estimation with histograms Experimental results Conclusions
November 3, 2005CIKM Experiments – Data sets Synthetic data Skew: Zipfian parameter z=0 (uniform) to 4 (highly skewed) Number of tuples: 10K to 1M Real data Cover Type: 581,012 tuples, 10 attributes Census Income: 32,561 tuples, 14 attributes Error measure: ratio error ERR = max{true/est-1, est/true-1}
November 3, 2005CIKM Effect of data skew N=100K di=1k
November 3, 2005CIKM Effect of number of tuples
November 3, 2005CIKM Results on real data 45 pairs91 pairs
November 3, 2005CIKM Accuracy of end-biased histograms Results on the “ capital-gain ” attribute of Census Income data set
November 3, 2005CIKM Conclusions Utilizing existing knowledge maintained in database systems Proposed upper/lower bounds as well as an estimator Considered two cases exact marginal frequencies Histograms: optimal histograms Experimental results show the effectiveness of the proposed method