Download presentation
Presentation is loading. Please wait.
Published bySilvester Bradford Modified over 9 years ago
1
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto xhyu@cs.toronto.edu Joint work with Nick Koudas (University of Toronto) and Calisto Zuzarte (IBM Toronto Lab)
2
Outline Background Motivation Related Work HASE Estimator Algorithms Bounds Experiments Conclusions
3
Query Optimization Execution plans differ in costs Difference can be huge (1 sec vs. 1 hour) Which Plan to Choose?? Query Optimization Estimate the costs of different plans Choose the plan with the least cost Cost Estimation Factors: run-time environments, data properties, …
4
Selectivity Important factor in costing: selectivity Fraction of records satisfying the predicate (s) E.g., 100 out of 10,000 records having salary > 3000 s = 100/10000 = 0.01 Selectivity can make a big difference Selectivity (s) cost Plan 1: Table scan Plan 2: Index scan Cost = s * const 2 Cost = const 1 s = 0.01
5
Related Work Two streams Synopsis-based Sampling-based Synopses Capture the characteristics of data Obtained off-line, used on-line E.g., Histograms
6
Histograms Salary 2500350050006000 Q: Selectivity of salary>3000? 1000 1500 800 1700 A: # of records in red / total # of records Estimate = ( 500 + 800 + 1700 ) / 5000 = 0.6 3000 … Salary … ……… … 3200.00 … … 2500.00 … … 4000.00 … … 6000.00 …
7
Synopses: pros and cons Pros: Built offline; can be used many times minimal overhead at selectivity estimation time Cons: Difficult to capture all useful information in a limited space Correlation between attributes
8
Sampling Number of records in the table: 10,000 Sample size: 100 Number of records having age > 50 and salary > 5500 : 12 Selectivity estimate = 12/100 = 0.12 True selectivity = 0.09
9
Sampling: pros and cons The good: Provides correlation info through the sample The bad: Cost, cost … Accurate results require a large portion of the data to be accessed Random access is much slower than sequential access
10
Summary SynopsesSampling Runtime cost Correlation information Take the best of both worlds? Capture correlation + reduce sampling rate
11
Outline Background Motivation Related Work Our approach: HASE Estimator Algorithms Bounds Experiments Conclusions
12
HASE Hybrid approach to selectivity estimation Salary 250035005000 6000 1000 1500 800 1700 Goal: Consistent utilization of both sources of information Benefits: 1.Correlation is captured (sampling) 2.Sample size can be significantly smaller (histograms)
13
Problem setting Conjuncts of predicates Q = P1^P2^P3 ^… (age>50)^(salary>5500)^(hire_date>”01-01-05”) P1 P2 P3 Selectivities of individual predicates (obtained from synopses) s1 = 0.1, s2 = 0.2, s3 = 0.05 A Sample S of n records Inclusion probability of record j : j For simple random sampling (SRS) j = n/N Query: Available info: Data: Table of size N Goal Estimate the selectivity s of the query Q
14
Example Table R with 10,000 records Query Q = P1^P2 on two attributes Suppose 500 records satisfy both predicates True Selectivity s = 500/10000 = 0.05
15
Histogram-based estimate Assuming independence between attributes Selectivity estimate Based on the histograms, s1 = 0.6, s2 = 0.3 Relative error = |0.18 – 0.05 | /0.05 = 260%
16
Sampling-based estimate Sample weight of j : d j = 1/ j Indicator variable Selectivity Estimate (HT estimator) Take a SRS of size 100 d j = 10000/100 = 100 9 records satisfy Q Estimate = 9*100/10000 = 0.09 Relative error = | 0.05 – 0.09 | / 0.05 = 80%
17
A new estimator Known selectivities (through histograms) s1, s2, … w j : (1) reproduce known selectivities of individual predicates (2) as close to d j as possible Original weights New weights Calibration estimator
18
Consistency with known selectivities P2=trueP2=false- P1=true0.090.560.65 P1=false0.240.110.35 -0.330.67 Observed frequencies from sample 100 sample records from 10,000 records in the table d j = 100 s1 = 0.6
19
Calibration estimator Why do we want w j to be as close as d j as possible? d j have the property of producing unbiased estimates w j remain nearly unbiased Keep w j as close to d j as possible
20
Constrained optimization problem Distance function D(x) (x = w j /d j ) Minimize Subject to w.r.t. w j j satisfies Pi? Yes: 1 No: 0 (As close to d j as possible) (reproduce known selectivities)
21
An algorithm based on Newton ’ s method Method of Lagrange multipliers Minimize w.r.t. where Can be solved using Newton’s method via an iterative procedure. w j
22
An alternative algorithm
23
Example P2=trueP2=false P1=true0.090.56 P1=false0.240.11 Observed frequencies from sample
24
Distance measures Requirements on the distance function (1) D is positive and strictly convex (2) D(1) = D ’ (1) = 0 (3) D ’’ (1) = 1 Linear function only one iteration required fast! w j < 0 possible negative estimates Multiplicative function Converges after a few iterations (typically two) w j > 0 always
25
Error bounds Probabilistic bounds = Pr ( both j and l are in the sample )
26
Synthetic data Skew: Zipfian distribution (z=0,1,2,3) Correlation: corr. coef. between attributes: [0, 1] Real data Census-Income data from UCI KDD Archive Population surveys by the US Census Bureau. ~200,000 records, 40 attributes Queries Range queries: attribute<= constant Equality queries: attribute = constant Experiments
27
Effect of correlation
28
Effect of data skew
29
Effect of sample rate
30
Effect of number of attributes
31
Conclusions Synopsis-based estimationSampling-based estimation Selectivity Estimation HASE The calibrated estimator Algorithms Probabilistic bounds on errors Experimental results Benefits: 1.Capturing correlation (sampling) 2.Sample size can be significantly smaller (histograms)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.