Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wavelet-based histograms for selectivity estimation

Similar presentations


Presentation on theme: "Wavelet-based histograms for selectivity estimation"— Presentation transcript:

1 Wavelet-based histograms for selectivity estimation
Paper: Yossi Matias, Jeffrey Scott Vitter, and Min Wang Presentation: Michael Ernst

2 Executive Summary Histograms aid query optimization
Problem: histograms are bulky Solution: compress histograms Technique: wavelet-based compression Claim: it works Estimate sizes of relations and query results Surajit Chaudhuri mentioned histograms as “a substantial [space] cost”, especially if there are very many attributes. Compression is not a new idea. We’ll use lossy compression. Wavelets are a compact, computationally efficient approximation to a function. Despite pages of numbers, they provide precious little evidence to back up their claim. The talk will follow this outline.

3 Histograms for query optimization
Estimate sizes of relations, range query results join ordering join implementation placement in operator tree Some other techniques, like sampling, are good for arbitrary queries, not just range queries. Join ordering: keep relations as small as possible, as long as possible Join implementation: e.g., hash-join if one relation fits in memory Operator tree placement: filters, group-by, other operations (eager vs. lazy)

4 Types of histogram Sampling Equi-width Equi-depth Maxdiff
Cumulative frequency ( frequency): splines wavelets 5 2 3 6 4 4 4 4 7 2 1 6 (16 values in [1..20]) All the data is itself a histogram: the perfect histogram. We must approximate to reduce the storage costs. Equi-width: buckets have same width Equi-depth: buckets have same number of elements (must store bucket boundaries instead of sizes). Maxdiff: split buckets at gaps in data (must store both bucket boundaries and sizes). Cumulative frequency is much smoother (easier to approximate) than frequency, but suffers no information loss.

5 The wavelet transform Lossless function representation (like Fourier)
Haar wavelets: made up of , , Result: sequence of coefficients Reconstruction: find and add the relevant components Just as Fourier transform decomposes a waveform into its constituent sine waves, the wavelet transform decomposes a waveform into its constituent finite square waves. Example: show Fourier transform for square wave. Example: show Haar wavelet transform for an exponential decay.

6 Wavelet compression Lossless: no compression
Thresholding: discard some coefficients. Keep: first m coefficients biggest m/2 coefficients greedy: select some coefficients, then iterate adding/deleting coefficients Further compression may be possible As many coefficients as original function values (exact if power of 2). Biggest m/2 coefficients is best for 2-norm error (also known as least-squares fit). Further compression (has nothing to do with wavelets, generally applicable): quantization, entropy encoding.

7 Multidimensional histograms
Avoid assumption of independence To estimate selectivity from 2-D cumulative frequency: (0,0) 1 -1 -1 1

8 Empirical evaluation Synthetic benchmarks Compare:
smooth cumulative frequency no “equals” queries Compare: sampling maxdiff Haar wavelets linear wavelets Are these benchmarks representative? Sampling: how many tries? Is this result typical? Maxdiff: performs worst when cumulative frequency is smooth Linear wavelets are best by far. Haar wavelets often outperform maxdiff, sometimes worse

9 The bottom line Wavelets improve histogram accuracy, thus improving selectivity estimation. Does this affect query execution time? This is the key issue. It doesn’t matter how much accuracy is improved if that additional accuracy can’t be used. Maybe current schemes are good enough. They don’t even mention this as an issue!


Download ppt "Wavelet-based histograms for selectivity estimation"

Similar presentations


Ads by Google