Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms
Contents Supplements to the previous talk Introduction to histograms for multi- dimensional data Global optimization of histograms Experiment results Conclusion
Summary Histograms approximate the frequency distribution of an attribute (or a set of attributes) group attribute values into "buckets" approximate the actual frequencies by the statistical information stored in each bucket. A taxonomy was discussed
Summary of the 1-D Histogram Taxonomy
equi-width equi-depth V-optimal(F, F) V-optimal(V, F) Max-Diff(V, F) Compressed(V, F) Data Distribution
Estimation Procedures Find the histogram buckets that can contain the query range Estimate the counts by studying the overlapped portion of the query and the buckets continuous values assumption point value assumption uniform spread assumption
Uniform Frequency Assumptions continuous values assumption point value assumption uniform spread assumption area = freq = freq = 4
Uniform Frequency Assumptions Experiments Results show that uniform spread assumption gives the best estimation uniform spread assumption is better against "non-uniform spread" data than continuous value assumption (strange!) e.g., if spread is large, many queries should return 0.
Experiments Datasets in form of pairs, distribution of dataValues uniform, zipf_incr, zipf_dec, etc. frequency: zipf, with different skew factors different value-frequency correlations positive, negative, random
Experiments The data value distribution:
Experiments Queries has a form of a < X b 5 types a = -1, b in the range of values a = -1, b = one of the appeared values random a and b, s.t. selectivity in [0, 0.2] random a and b, s.t. selectivity in [0.8, 1] random a and b, s.t. selectivity in [0, 0.2] [0.8, 1]
Experiments Histograms all histograms described in the taxonomy the histograms are of the same size (not the same number of buckets) built from 2000 samples (except trivial, equi-depth: precise, and P 2 ) built through scanning the data once
Experiments cusp_max value distribution random value & freq. correlation z = 1 sort param = V usually means better accuracy
Experiments error based on v-optimal(V,A) histogram increasing sample size can give better results
End of Part I
Part II: Global Optimization of Histograms (GOH)
Histograms for n-D data Histograms discussed previously are on single attribute (or 1-D data) Two main approaches for n-D data: Use n 1-D histograms use an n-D histogram
Histograms for n-D data n 1-D histograms need "attribute value independence assumption" can use all 1-D histogram techniques quite good accuracy already representative: GOH [Jagadish et al. SIGMOD'01]
Histograms for n-D data n-D histograms don't need the "avia" which is usually not true difficult to compute, store and maintain a "good" partition of the n-D space into buckets representatives: MHIST [Poosala & Ioannidis VLDB'97], H-Tree [Muralikrishna & DeWitt SIGMOD'88]
Global Optimization of Histograms (GOH) given a space budget of B buckets, find an optimal assignment of buckets among the dimensions to minimize the error i.e., give more buckets to attributes that are used frequently or have skew distributions
GOH -- example e.g., if A1:,,, A2:,, and we have 4 buckets, an A1:1, A2:3 assignment is better than a A1:2, A2:2 assignment.
Computing GOH Exhaustive Algorithm (GOHEA) Based on Dynamic Programming (GOHDP) Greedy Approach (GOHGA) Greedy Approach with Remedy (GOHGR)
GOHEA For every possible bucket assignment, calculate the Error metric and find the minimum clearly too inefficient
GOHDP Define: E(b, k) = the minimum error of using b buckets to store k histograms error(b, k) = the error of using b buckets to store the k th histogram observation: E(b, k) = min( E(b-x, k-1) + error(x, k) )
GOHDP calculate all error(b, k): O(N 2 BM) another algorithm based in DP can calculate all error(b, k) with a given k in O(N 2 B) [Jagadish et al. VLDB'98] fill the E(b, k): O(B 2 M) O(BM) entries to be filled, and O(B) computations for each entry
GOHDP note that if we know the allocation beforehand, we need O(N 2 B) to construct the histograms it's still inefficient if M is large
GOHGA The greedy approach is O(N 2 B) i.e., nearly no penalty than direct construction (use the same number of buckets on all attributes) define: marginal gain: m k (i,j) = error(i,k) – error(j,k), i.e., the reduction in error if we use j buckets instead of i buckets.
GOHGA 1. assign 1 buckets to each dimension 2. allocate buckets one by one to the dimension which has the greatest marginal gain by the new bucket 3. repeat until all buckets are assigned
GOHGA O(B) steps O(N 2 ) per marginal gain calculation: since we can get error(b, k) from b = 1 in O(N 2 b) overall O(N 2 B) but sometimes GOHGA does not return the optimum
GOHGR greedy: look ahead 1 step to see which allocation has the greatest marginal gain remedy: look ahead 2 steps to see if it can find a better allocation e.g., m 1 (3,4)=30, m 1 (4,5)=130, m 2 (3,4)=40, m 2 (4,5)=40
Experiments aim: to show that GOH really has a smaller error by allocating more buckets to skewer data to show that GOH is efficient to be computed
Experiments types abs/rel errors of different attributes abs/rel errors for different bucket budgets abs/rel errors for different distribution skews running time
Experiments dataset (synthesis data): 5 attributes tuples, 500 values per attr frequencies follow Zipf distribution, random association between freq. and value. the 5 attrs has z = 0, 0.01, 0.1, 1, queries: X a
Experiments
dataset (TPC-D dataset with skew data) as more realistic dataset has similar results as the synthesis data dataset (2 attrs) to evaluate the gain of GOH due to the skew difference between attrs
Experiments TG3: z=0, 2 TG4: z=0.02,1.8 TG5: z=1.8,2
Experiments
Conclusions GOH has smaller errors because it assigns more buckets to skew or frequently used distributions Nearly no time penalty on building GOH using GOHGR
Future Work The methods presented can't solve the n-D histogram problem completely Try to apply SF-Tree to store and retrieve the buckets in multi- dimensional histogram efficiently.
References [Jagadish et al. VLDB'98] H.V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Ken Sevcik, Torsten Suel, Optimal Histograms with Quality Guarantees, VLDB’98 [Poosala et al. SIMGOD'96] Viswanath Poosala, Yannis Ioannidis, Peter Haas, Eugene Shekita, Improved Histograms for Selectivity Estimation of Range Predicates, SIGMOD’96 [Poosala & Ioannidis VLDB'97] Viswanath Poosala and Yannis Ioannidis, Selectivity Estimation Without the Attribute Value Independence Assumption, VLDB'97
References [Muralikrishna & DeWitt SIGMOD'88] M. Muralikrishna and D. DeWitt, Equi-Depth Histograms for Estimating Selectivity Factors for Multi-Dimensional Queries, SIGMOD’88