Wavelet-based histograms for selectivity estimation

Slides:



Advertisements
Similar presentations
Multimedia Data Compression
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CSE544 Database Statistics Tuesday, February 15 th, 2011 Dan Suciu , Winter
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Fast Algorithms For Hierarchical Range Histogram Constructions
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Extensions of wavelets
Optimal Workload-Based Weighted Wavelet Synopsis
School of Computing Science Simon Fraser University
Computer Graphics Recitation 6. 2 Motivation – Image compression What linear combination of 8x8 basis signals produces an 8x8 block in the image?
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
Lecture05 Transform Coding.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Spatial and Temporal Data Mining
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Wavelet Transform A very brief look.
SWE 423: Multimedia Systems Chapter 7: Data Compression (3)
Introduction to Wavelets
1 Computer Science 631 Lecture 4: Wavelets Ramin Zabih Computer Science Department CORNELL UNIVERSITY.
Methods of Image Compression by PHL Transform Dziech, Andrzej Slusarczyk, Przemyslaw Tibken, Bernd Journal of Intelligent and Robotic Systems Volume: 39,
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Department of Computer Engineering University of California at Santa Cruz Data Compression (2) Hai Tao.
SWE 423: Multimedia Systems Chapter 7: Data Compression (4)
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
SWE 423: Multimedia Systems Chapter 7: Data Compression (5)
Fundamentals Rawesak Tanawongsuwan
Compression is the reduction in size of data in order to save space or transmission time. And its used just about everywhere. All the images you get on.
Computer Vision – Compression(2) Hanyang University Jong-Il Park.
CMPT 365 Multimedia Systems
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Image Compression Supervised By: Mr.Nael Alian Student: Anwaar Ahmed Abu-AlQomboz ID: IT College “Multimedia”
3D Geometry Coding using Mixture Models and the Estimation Quantization Algorithm Sridhar Lavu Masters Defense Electrical & Computer Engineering DSP GroupRice.
Outline Kinds of Coding Need for Compression Basic Types Taxonomy Performance Metrics.
Histograms for Selectivity Estimation
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Marwan Al-Namari 1 Digital Representations. Bits and Bytes Devices can only be in one of two states 0 or 1, yes or no, on or off, … Bit: a unit of data.
Lecture 17 - Approximation Methods CVEN 302 July 17, 2002.
The Discrete Wavelet Transform for Image Compression Speaker: Jing-De Huang Advisor: Jian-Jiun Ding Graduate Institute of Communication Engineering National.
Dr. Abdul Basit Siddiqui FUIEMS. QuizTime 30 min. How the coefficents of Laplacian Filter are generated. Show your complete work. Also discuss different.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Fundamentals of Multimedia Chapter 6 Basics of Digital Audio Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Entropy vs. Average Code-length Important application of Shannon’s entropy measure is in finding efficient (~ short average length) code words The measure.
Dense-Region Based Compact Data Cube
Lecture 3: Uninformed Search
Singular Value Decomposition and its applications
Image Representation and Description – Representation Schemes
Data Transformation: Normalization
ECE3340 Numerical Fitting, Interpolation and Approximation
Updating SF-Tree Speaker: Ho Wai Shing.
Noisy Data Noise: random error or variance in a measured variable.
Wavelets : Introduction and Examples
UNIT-2 Data Preprocessing
Are they better or worse than a B+Tree?
Evaluation of Relational Operations: Other Operations
File Processing : Query Processing
Fourier Transform and Data Compression
Image Transforms for Robust Coding
Wavelet Transform (Section )
Dept. of Computer Science University of Liverpool
Overview of Query Evaluation
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
Finding Periodic Discrete Events in Noisy Streams
Govt. Polytechnic Dhangar(Fatehabad)
Wednesday, 5/8/2002 Hash table indexes, physical operators
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Wavelet-based histograms for selectivity estimation Paper: Yossi Matias, Jeffrey Scott Vitter, and Min Wang Presentation: Michael Ernst

Executive Summary Histograms aid query optimization Problem: histograms are bulky Solution: compress histograms Technique: wavelet-based compression Claim: it works Estimate sizes of relations and query results Surajit Chaudhuri mentioned histograms as “a substantial [space] cost”, especially if there are very many attributes. Compression is not a new idea. We’ll use lossy compression. Wavelets are a compact, computationally efficient approximation to a function. Despite pages of numbers, they provide precious little evidence to back up their claim. The talk will follow this outline.

Histograms for query optimization Estimate sizes of relations, range query results join ordering join implementation placement in operator tree Some other techniques, like sampling, are good for arbitrary queries, not just range queries. Join ordering: keep relations as small as possible, as long as possible Join implementation: e.g., hash-join if one relation fits in memory Operator tree placement: filters, group-by, other operations (eager vs. lazy)

Types of histogram Sampling Equi-width Equi-depth Maxdiff Cumulative frequency ( frequency): splines wavelets 5 2 3 6 4 4 4 4 7 2 1 6 (16 values in [1..20]) All the data is itself a histogram: the perfect histogram. We must approximate to reduce the storage costs. Equi-width: buckets have same width Equi-depth: buckets have same number of elements (must store bucket boundaries instead of sizes). Maxdiff: split buckets at gaps in data (must store both bucket boundaries and sizes). Cumulative frequency is much smoother (easier to approximate) than frequency, but suffers no information loss.

The wavelet transform Lossless function representation (like Fourier) Haar wavelets: made up of , , Result: sequence of coefficients Reconstruction: find and add the relevant components Just as Fourier transform decomposes a waveform into its constituent sine waves, the wavelet transform decomposes a waveform into its constituent finite square waves. Example: show Fourier transform for square wave. Example: show Haar wavelet transform for an exponential decay.

Wavelet compression Lossless: no compression Thresholding: discard some coefficients. Keep: first m coefficients biggest m/2 coefficients greedy: select some coefficients, then iterate adding/deleting coefficients Further compression may be possible As many coefficients as original function values (exact if power of 2). Biggest m/2 coefficients is best for 2-norm error (also known as least-squares fit). Further compression (has nothing to do with wavelets, generally applicable): quantization, entropy encoding.

Multidimensional histograms Avoid assumption of independence To estimate selectivity from 2-D cumulative frequency: (0,0) 1 -1 -1 1

Empirical evaluation Synthetic benchmarks Compare: smooth cumulative frequency no “equals” queries Compare: sampling maxdiff Haar wavelets linear wavelets Are these benchmarks representative? Sampling: how many tries? Is this result typical? Maxdiff: performs worst when cumulative frequency is smooth Linear wavelets are best by far. Haar wavelets often outperform maxdiff, sometimes worse

The bottom line Wavelets improve histogram accuracy, thus improving selectivity estimation. Does this affect query execution time? This is the key issue. It doesn’t matter how much accuracy is improved if that additional accuracy can’t be used. Maybe current schemes are good enough. They don’t even mention this as an issue!