A PPROXIMATE Q UERY P ROCESSING U SING W AVELETS Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST.

Slides:



Advertisements
Similar presentations
Multimedia Data Compression
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
CS4432: Database Systems II
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
CENG536 Computer Engineering Department Çankaya University.
Optimal Workload-Based Weighted Wavelet Synopsis
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
D ATABASE S YSTEMS I A DMIN S TUFF. 2 Mid-term exam Tuesday, Oct 2:30pm Room 3005 (usual room) Closed book No cheating, blah blah No class on Oct.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Lecture05 Transform Coding.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Optimized Numerical Mapping Scheme for Filter-Based Exon Location in DNA Using a Quasi-Newton Algorithm P. Ramachandran, W.-S. Lu, and A. Antoniou Department.
1 Computer Science 631 Lecture 4: Wavelets Ramin Zabih Computer Science Department CORNELL UNIVERSITY.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
NEW APPROACH TO CALCULATION OF RANGE OF POLYNOMIALS USING BERNSTEIN FORMS.
Fundamentals of Multimedia Chapter 8 Lossy Compression Algorithms (Wavelet) Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Internet Management Research Dept. Bell Labs, Lucent
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Time Series Data Analysis - II
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Summarized by Soo-Jin Kim
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Shape Matching for Model Alignment 3D Scan Matching and Registration, Part I ICCV 2005 Short Course Michael Kazhdan Johns Hopkins University.
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
OnLine Analytical Processing (OLAP)
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Wavelet-based Coding And its application in JPEG2000 Monia Ghobadi CSC561 final project
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.
Histograms for Selectivity Estimation
A survey of different shape analysis techniques 1 A Survey of Different Shape Analysis Techniques -- Huang Nan.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
3D Motion Data Mining Multimedia Project Multimedia and Network Lab, Department of Computer Science.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Wavelets Chapter 7 Serkan ERGUN. 1.Introduction Wavelets are mathematical tools for hierarchically decomposing functions. Regardless of whether the function.
Dense-Region Based Compact Data Cube
Data Transformation: Normalization
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
A paper on Join Synopses for Approximate Query Answering
Wavelets : Introduction and Examples
Vectors.
Chapter 15 QUERY EXECUTION.
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

A PPROXIMATE Q UERY P ROCESSING U SING W AVELETS Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented By: Charanmai Koorapati Ramesh Harika Guniganti

A GENDA Introduction Motivation Prior Work Wavelet Decomposition Building Wavelet Synopses Processing Relational Queries Experimental Study Quality Metrics Query Execution Times Conclusion

D ECISION S UPPORT S YSTEMS Comparative sales figures between one week and the next Projected revenue figures based on new product sales assumptions The consequences of different decision alternatives, given past experience in a context that is described

M OTIVATION DSS users pose very complex queries to the underlying DBMS that require complex operations over Gigabytes or Terabytes of disk- resident data. SQL Query Exact Answer Decision Support Systems Long Response Times!

 Exact answers NOT always required.  User may prefer a fast, approximate answer. SQL Query Exact Answer Compact Data Synopses “Transformed” Query KB/MB Approximate Answer FAST!! Long Response Times! Decision Support Systems GB/TB

A PPROXIMATE Q UERY P ROCESSING Viable solution for dealing with Huge amounts of data High query complexities Increasingly stringent response-time requirements

P RIOR W ORK  Sampling Based Techniques Limitations: Join operator on two uniform samples Non- aggregate query  Histogram Based Techniques Limitations: Storage overhead Construction cost achieve reasonable error rates for high dimensional data sets.

W AVELET B ASED T ECHNIQUES Wavelet -mathematical function used to divide a given function or continuous-time signal into different frequency componentscontinuous-time signal and study each component with a resolution that matches its scale. This paper extends the scope of earlier work, establishing the viability and effectiveness of wavelets as a generic approximate query processing tool for modern high-dimensional DSS applications.

A PPROXIMATE Q UERY P ROCESSING USING W AVELETS Novel approach consisting of two steps- Multi dimensional Haar wavelets - effective, compact synopses Novel query processing alogorithms - fast and accurate approximate query answers

W AVELET D ECOMPOSITION /T RANSFORM One- dimensional Haar Wavelets Data vector A = [2,2,5,7] Wavelet transform, W A = [4,-2,0,-1] ResolutionAveragesDetail Coefficients 2[2,2,5,7]- 1[2,6][0,-1] 0[4][-2] Wavelet Coefficient

N ORMALIZED W AVELET T RANSFORM To equalize the importance of all the wavelet coefficients, we normalize the final entries of W A, by dividing each wavelet coefficient by √2 ^l, where l is the level of resolution. Thus W A = [4,-2,0,-1/ √2]

M ULTIDIMENSIONAL H AAR W AVELETS Standard Decomposition First, fix an ordering for the data dimensions(say 1,2,… d) and then proceed to apply the complete one-dimensional wavelet transform for each one dimensional “row” of array cells along dimension k, for all k=1,2…d. Non- standard Decomposition Given an ordering for the data dimensions (1,2,…d), we perform one step of pairwise averaging and differencing for each one dimensional row of array cells along dimension k, for each k=1,…d. This process is repeated recursively only on quadrant containing averages across all dimensions.

N ON -S TANDARD D ECOMPOSITION

E XAMPLE D ECOMPOSITION O F A 4×4 A RRAY

M ULTIDIMENSIONAL H AAR C OEFFICIENTS - S EMANTICS AND R EPRESENTATION

S UPPORT R EGIONS A ND S IGNS FOR 16 NONSTANDARD 2- DIMENSIONAL H AAR BASIS FUNCTIONS

Haar wavelet coefficient can be represented with the triple W= where 1) W.R is d-dimensional support hyper-rectangle of W Along each dimension j,1<=j<=d Low boundary value - W.R.bound[j].lo High boundary value - W.R.bound[j].hi Coefficient W contributes to each data cell of A[i 1,…i d ] satisfying the condition W.R.bound[j].lo <= i j <= W.R.bound[j].hi for all dimensions j, 1<= j<=d

2) W.S stores sign information for all d-dimensional quadrants of W.R. The two elements of the sign vector of coefficient W along dimensions j are denoted by W.S.sign[j].lo, W.S.sign[j].hi corresponding to lower and upper half of W.R’s extent along dimension j. The sign information is computed as a product of the d-sign entries that map to that quadrant. 3) W.v is the (scalar) magnitude of coefficient W. This is exactly the quantity that W contributes to all data array cells enclosed in W.R.

B UILDING W AVELET -C OEFFICIENT S YNOPSES Joint Data Distribution Array Array Attr Attr Relation (ROLAP) Representation Capturing d-dimensional array A R (joint frequency distribution) from relational table R (“set of tuples” ROLAP)

What is the size of the wavelet-coefficient synopsis?

P ROCESSING R ELATIONAL Q UERIES I N W AVELET -C OEFFICIENT DOMAIN Wavelet Synopses Approximate Relations Query Results in Wavelet Domain Final Approximate Results Render Querying in Wavelet Domain Querying in Relation Domain Compressed domain (FAST) Relation domain (SLOW) Reduce relations into compact wavelet-coefficient synopses

W AVELET Q UERY P ROCESSING join project select set of coefficients set of coefficients set of coefficients Each operator (e.g., select, project, join, aggregates, etc.) input: set of wavelet coefficients output: set of wavelet coefficients Finally, rendering step input: set of wavelet coefficients output: (multi)set of tuples render

Q UICK R EVIEW O F N OTATIONS

S ELECTION O PERATOR ( SELECT )

S ELECTION -- R ELATIONAL D OMAIN In relational domain, interested in only those cells inside query range In wavelet domain, interested in only the coefficients that contribute to those cells Dim. D Query Range Dim. D1 Joint Data Distribution Array Relation

A PPROXIMATE Q UERY E XECUTION E NGINE P ROCESS F OR S ELECT

S ELECTION -- W AVELET D OMAIN D2 D1 Query Range D2 D1

P ROJECTION O PERATOR ( PROJECT )

P ROJECTION - W AVELET D OMAIN

J OIN O PERATOR ( JOIN )

E QUI - JOIN -- R ELATIONAL D OMAIN Relational domain: Join count= 7*3 = (A1-A3)*(B2+B3) Wavelet domain: A1*B2 + A1*B3 - A3*B2 - A3*B3 Consider all pairs of coefficients: (1) check joinability (overlap in join dimension(s)), (2) compute output coefficients 3 Coefficients A1 (+) and A3 (-) contribute to this cell Coefficients B2 (+), and B3 (+) contribute to this cell Join along D1 Joint Data Distribution of Relation 1 of Relation 1 Joint Data Distr. of Relation 2 of Relation Dim. D2 Dim. D3 Join Dim. D1 Relation 1 Relation 2

E QUI - JOIN -- W AVELET D OMAIN - + D3 D D2 D1 v1 v2 Join output coefficient: D3 D1 + D2 - v = v1 * v2

E XPERIMENTAL S TUDY Improved Answer Quality Low Synopsis Construction Costs Fast Query Execution

E RROR M ETRICS FOR S ET -V ALUED Q UERY A NSWERS Need an error metric for (multi)sets that accounts for both differences in element frequencies differences in element values Proposed Solutions MAC (Match-And-Compare) Error [IP99]: based on perfect bipartite graph matching EMD (Earth Mover’s Distance) Error [CGR00, RTG98]: based on bipartite network flows

Q UERY E XECUTION T IMES

SELECT-J OIN -SUM QUERY ERRORS ON REAL - LIFE DATA

SELECT query errors on real-life data

SELECT-SUM QUERY ERRORS ON REAL - LIFE DATA

CONCLUSION Multidimensional wavelets as an effective tool for general purpose approximate query processing in modern, high dimensional applications. The query processing algorithms operate directly on the wavelet-coefficient synopses of relational data, thus allowing for very fast processing of arbitrarily complex queries entirely in the wavelet-coefficient domain. Extensive experimental study with synthetic as well as real-life data sets that verifies the effectiveness of our wavelet-based approach compared to both sampling and histograms

Questions??? THANK YOU