Presentation is loading. Please wait.

Presentation is loading. Please wait.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Similar presentations


Presentation on theme: "HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas."— Presentation transcript:

1 HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto xhyu@cs.toronto.edu Joint work with Nick Koudas (University of Toronto) and Calisto Zuzarte (IBM Toronto Lab)

2 Outline  Background Motivation Related Work  HASE Estimator Algorithms Bounds Experiments  Conclusions

3 Query Optimization  Execution plans differ in costs Difference can be huge (1 sec vs. 1 hour) Which Plan to Choose??  Query Optimization Estimate the costs of different plans Choose the plan with the least cost  Cost Estimation Factors: run-time environments, data properties, …

4 Selectivity  Important factor in costing: selectivity Fraction of records satisfying the predicate (s) E.g., 100 out of 10,000 records having salary > 3000 s = 100/10000 = 0.01  Selectivity can make a big difference Selectivity (s) cost Plan 1: Table scan Plan 2: Index scan Cost = s * const 2 Cost = const 1 s = 0.01

5 Related Work  Two streams Synopsis-based Sampling-based  Synopses Capture the characteristics of data Obtained off-line, used on-line E.g., Histograms

6 Histograms Salary 2500350050006000 Q: Selectivity of salary>3000? 1000 1500 800 1700 A: # of records in red / total # of records Estimate = ( 500 + 800 + 1700 ) / 5000 = 0.6 3000 … Salary … ……… … 3200.00 … … 2500.00 … … 4000.00 … … 6000.00 …

7 Synopses: pros and cons  Pros: Built offline; can be used many times minimal overhead at selectivity estimation time  Cons: Difficult to capture all useful information in a limited space Correlation between attributes

8 Sampling Number of records in the table: 10,000 Sample size: 100 Number of records having age > 50 and salary > 5500 : 12 Selectivity estimate = 12/100 = 0.12 True selectivity = 0.09

9 Sampling: pros and cons  The good: Provides correlation info through the sample  The bad: Cost, cost … Accurate results require a large portion of the data to be accessed Random access is much slower than sequential access

10 Summary SynopsesSampling Runtime cost  Correlation information  Take the best of both worlds? Capture correlation + reduce sampling rate

11 Outline  Background Motivation Related Work  Our approach: HASE Estimator Algorithms Bounds Experiments  Conclusions

12 HASE  Hybrid approach to selectivity estimation Salary 250035005000 6000 1000 1500 800 1700 Goal: Consistent utilization of both sources of information Benefits: 1.Correlation is captured (sampling) 2.Sample size can be significantly smaller (histograms)

13 Problem setting Conjuncts of predicates Q = P1^P2^P3 ^… (age>50)^(salary>5500)^(hire_date>”01-01-05”) P1 P2 P3  Selectivities of individual predicates (obtained from synopses) s1 = 0.1, s2 = 0.2, s3 = 0.05  A Sample S of n records Inclusion probability of record j :  j For simple random sampling (SRS)  j = n/N Query: Available info: Data: Table of size N Goal Estimate the selectivity s of the query Q

14 Example Table R with 10,000 records Query Q = P1^P2 on two attributes Suppose 500 records satisfy both predicates True Selectivity s = 500/10000 = 0.05

15 Histogram-based estimate Assuming independence between attributes Selectivity estimate Based on the histograms, s1 = 0.6, s2 = 0.3 Relative error = |0.18 – 0.05 | /0.05 = 260%

16 Sampling-based estimate Sample weight of j : d j = 1/  j Indicator variable Selectivity Estimate (HT estimator) Take a SRS of size 100  d j = 10000/100 = 100 9 records satisfy Q Estimate = 9*100/10000 = 0.09 Relative error = | 0.05 – 0.09 | / 0.05 = 80%

17 A new estimator Known selectivities (through histograms) s1, s2, … w j : (1) reproduce known selectivities of individual predicates (2) as close to d j as possible Original weights New weights Calibration estimator

18 Consistency with known selectivities P2=trueP2=false- P1=true0.090.560.65 P1=false0.240.110.35 -0.330.67 Observed frequencies from sample 100 sample records from 10,000 records in the table  d j = 100 s1 = 0.6

19 Calibration estimator Why do we want w j to be as close as d j as possible? d j have the property of producing unbiased estimates w j remain nearly unbiased Keep w j as close to d j as possible

20 Constrained optimization problem Distance function D(x) (x = w j /d j ) Minimize Subject to w.r.t. w j j satisfies Pi? Yes: 1 No: 0 (As close to d j as possible) (reproduce known selectivities)

21 An algorithm based on Newton ’ s method Method of Lagrange multipliers Minimize w.r.t. where Can be solved using Newton’s method via an iterative procedure.   w j 

22 An alternative algorithm

23 Example P2=trueP2=false P1=true0.090.56 P1=false0.240.11 Observed frequencies from sample

24 Distance measures  Requirements on the distance function (1) D is positive and strictly convex (2) D(1) = D ’ (1) = 0 (3) D ’’ (1) = 1  Linear function only one iteration required  fast! w j < 0 possible  negative estimates  Multiplicative function Converges after a few iterations (typically two) w j > 0 always

25 Error bounds  Probabilistic bounds = Pr ( both j and l are in the sample )

26  Synthetic data Skew: Zipfian distribution (z=0,1,2,3) Correlation: corr. coef. between attributes: [0, 1]  Real data Census-Income data from UCI KDD Archive Population surveys by the US Census Bureau. ~200,000 records, 40 attributes  Queries Range queries: attribute<= constant Equality queries: attribute = constant Experiments

27 Effect of correlation

28 Effect of data skew

29 Effect of sample rate

30 Effect of number of attributes

31 Conclusions Synopsis-based estimationSampling-based estimation Selectivity Estimation HASE The calibrated estimator Algorithms Probabilistic bounds on errors Experimental results Benefits: 1.Capturing correlation (sampling) 2.Sample size can be significantly smaller (histograms)


Download ppt "HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas."

Similar presentations


Ads by Google