HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas (University of Toronto) and Calisto Zuzarte (IBM Toronto Lab)
Outline Background Motivation Related Work HASE Estimator Algorithms Bounds Experiments Conclusions
Query Optimization Execution plans differ in costs Difference can be huge (1 sec vs. 1 hour) Which Plan to Choose?? Query Optimization Estimate the costs of different plans Choose the plan with the least cost Cost Estimation Factors: run-time environments, data properties, …
Selectivity Important factor in costing: selectivity Fraction of records satisfying the predicate (s) E.g., 100 out of 10,000 records having salary > 3000 s = 100/10000 = 0.01 Selectivity can make a big difference Selectivity (s) cost Plan 1: Table scan Plan 2: Index scan Cost = s * const 2 Cost = const 1 s = 0.01
Related Work Two streams Synopsis-based Sampling-based Synopses Capture the characteristics of data Obtained off-line, used on-line E.g., Histograms
Histograms Salary Q: Selectivity of salary>3000? A: # of records in red / total # of records Estimate = ( ) / 5000 = … Salary … ……… … … … … … … … …
Synopses: pros and cons Pros: Built offline; can be used many times minimal overhead at selectivity estimation time Cons: Difficult to capture all useful information in a limited space Correlation between attributes
Sampling Number of records in the table: 10,000 Sample size: 100 Number of records having age > 50 and salary > 5500 : 12 Selectivity estimate = 12/100 = 0.12 True selectivity = 0.09
Sampling: pros and cons The good: Provides correlation info through the sample The bad: Cost, cost … Accurate results require a large portion of the data to be accessed Random access is much slower than sequential access
Summary SynopsesSampling Runtime cost Correlation information Take the best of both worlds? Capture correlation + reduce sampling rate
Outline Background Motivation Related Work Our approach: HASE Estimator Algorithms Bounds Experiments Conclusions
HASE Hybrid approach to selectivity estimation Salary Goal: Consistent utilization of both sources of information Benefits: 1.Correlation is captured (sampling) 2.Sample size can be significantly smaller (histograms)
Problem setting Conjuncts of predicates Q = P1^P2^P3 ^… (age>50)^(salary>5500)^(hire_date>” ”) P1 P2 P3 Selectivities of individual predicates (obtained from synopses) s1 = 0.1, s2 = 0.2, s3 = 0.05 A Sample S of n records Inclusion probability of record j : j For simple random sampling (SRS) j = n/N Query: Available info: Data: Table of size N Goal Estimate the selectivity s of the query Q
Example Table R with 10,000 records Query Q = P1^P2 on two attributes Suppose 500 records satisfy both predicates True Selectivity s = 500/10000 = 0.05
Histogram-based estimate Assuming independence between attributes Selectivity estimate Based on the histograms, s1 = 0.6, s2 = 0.3 Relative error = |0.18 – 0.05 | /0.05 = 260%
Sampling-based estimate Sample weight of j : d j = 1/ j Indicator variable Selectivity Estimate (HT estimator) Take a SRS of size 100 d j = 10000/100 = records satisfy Q Estimate = 9*100/10000 = 0.09 Relative error = | 0.05 – 0.09 | / 0.05 = 80%
A new estimator Known selectivities (through histograms) s1, s2, … w j : (1) reproduce known selectivities of individual predicates (2) as close to d j as possible Original weights New weights Calibration estimator
Consistency with known selectivities P2=trueP2=false- P1=true P1=false Observed frequencies from sample 100 sample records from 10,000 records in the table d j = 100 s1 = 0.6
Calibration estimator Why do we want w j to be as close as d j as possible? d j have the property of producing unbiased estimates w j remain nearly unbiased Keep w j as close to d j as possible
Constrained optimization problem Distance function D(x) (x = w j /d j ) Minimize Subject to w.r.t. w j j satisfies Pi? Yes: 1 No: 0 (As close to d j as possible) (reproduce known selectivities)
An algorithm based on Newton ’ s method Method of Lagrange multipliers Minimize w.r.t. where Can be solved using Newton’s method via an iterative procedure. w j
An alternative algorithm
Example P2=trueP2=false P1=true P1=false Observed frequencies from sample
Distance measures Requirements on the distance function (1) D is positive and strictly convex (2) D(1) = D ’ (1) = 0 (3) D ’’ (1) = 1 Linear function only one iteration required fast! w j < 0 possible negative estimates Multiplicative function Converges after a few iterations (typically two) w j > 0 always
Error bounds Probabilistic bounds = Pr ( both j and l are in the sample )
Synthetic data Skew: Zipfian distribution (z=0,1,2,3) Correlation: corr. coef. between attributes: [0, 1] Real data Census-Income data from UCI KDD Archive Population surveys by the US Census Bureau. ~200,000 records, 40 attributes Queries Range queries: attribute<= constant Equality queries: attribute = constant Experiments
Effect of correlation
Effect of data skew
Effect of sample rate
Effect of number of attributes
Conclusions Synopsis-based estimationSampling-based estimation Selectivity Estimation HASE The calibrated estimator Algorithms Probabilistic bounds on errors Experimental results Benefits: 1.Capturing correlation (sampling) 2.Sample size can be significantly smaller (histograms)