Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
CS4432: Database Systems II
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Fast Algorithms For Hierarchical Range Histogram Constructions
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Optimal Workload-Based Weighted Wavelet Synopsis
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Chapter 8 File organization and Indices.
Wavelet Packets For Wavelets Seminar at Haifa University, by Eugene Mednikov.
Evaluating Hypotheses
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Fast multiresolution image querying CS474/674 – Prof. Bebis.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Birch: An efficient data clustering method for very large databases
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Wavelet Synopses with Predefined Error Bounds: Windfalls of Duality Panagiotis Karras DB seminar, 23 March, 2006.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
1 Complex Images k’k’ k”k” k0k0 -k0-k0 branch cut   k 0 pole C1C1 C0C0 from the Sommerfeld identity, the complex exponentials must be a function.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Wavelets Chapter 7 Serkan ERGUN. 1.Introduction Wavelets are mathematical tools for hierarchically decomposing functions. Regardless of whether the function.
Dense-Region Based Compact Data Cube
Singular Value Decomposition and its applications
Memory Management.
Data Transformation: Normalization
Fast multiresolution image querying
Database Management System
Chapter 12: Query Processing
Evaluation of Relational Operations
Chapter 15 QUERY EXECUTION.
Evaluation of Relational Operations: Other Operations
Spatial Online Sampling and Aggregation
Chapter 12 Query Processing (1)
Evaluation of Relational Operations: Other Techniques
Chapter 17 Designing Databases
Wavelet-based histograms for selectivity estimation
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang

Guidelines * Overview * Preliminaries * The New Approach * The construction of the Algorithm * Experiments and Results  Summery

The problem Computing multidimensional aggregates in high dimensions is a performance bottleneck for many On-Line Analytical Processing (OLAP) applications. Obtaining the exact answer to an aggregation query can be prohibitively expensive in terms of time and/or storage space in data warehouse environment. Obviously, it is advantageous to have fast, approximate answers to OLAP aggregation queries.

Processing Methods There are two classes of methods for processing OLAP queries: Exact Methods Focus on how to compute the exact data cube Exact Methods Focus on how to compute the exact data cube Approximate Methods Becoming attractive in OLAP applications. They have been used in DBMS for a long time. In choosing proper approximation technique, there are two concerns : Approximate Methods Becoming attractive in OLAP applications. They have been used in DBMS for a long time. In choosing proper approximation technique, there are two concerns : J Efficiency J Accuracy

Histograms and Sampling Methods Advantage: Simple and natural Construction procedure is very efficient Disadvantage: Inefficient to construct in high dimensional Can not fit in internal memory Histograms and sampling are used in a variety of important applications where quick approximations of an array of values are needed. Use of Wavelet-based techniques to construct analogs of histograms in databases has showed substantial improvements in accuracy over random sampling and other histogram based approaches:

The Intended Solution 7Traditional Histogram infeasible for massive high dimensional data sets 7 Previously Developed Wavelet Technique efficient only for dense data 7 Previously Approximation Technique not accurate enough results for typical queries The proposed method provides approximate answers to high dimensional OLAP aggregation queries in MASSIVE SPARSE DATA SETS in time efficient and space efficient manner.

The Compact Data Cube The performance of this method depends in the compact data cube, which is an approximate and space efficient representation of the underlying multidimensional array, based upon multiresolution wavelet decomposition. In the on-line phase, each aggregation query can generally be answered using the compact data cube in one I/O or a small number of I/Os, depending upon the desired accuracy.

The Data Set A particular characteristics of the data sets is that they are MASSIVE AND SPARSE Denotes the set of dimensions S d-dimensional array which represent the underlying data Denotes the total size of array S where |D i| is the size of dimension D i Contains the value of the measure attribute for the corresponding combination of the functional attribute Is defined to be the number of populated entries in S

Range Sum Queries An important class of aggregation queries are the so called range sum queries, which are defined by applying the sum operation over a selected continuos range in the domain of some of the attributes. A range sum query can generally be formulated as follows:

The d’-Dimensional Range Sum Queries An interesting subset of the general range sum queries are d’-dimensional range sum queries in which d’<<d. In this case ranges are specified for only d’ dimensions, and the ranges for the other d-d’ dimensions are implicitly set to be the entire domain

Traditional vs. New approach In traditional approaches of answering range sum queries using data cube, all the subcubes of the data cube need to be precomputed and stored. When a query is given, a search is conducted in the data cube and relevant information is fetched. In the new approach, as usual some preprocessing work is done on the original arrays, but instead of computing and storing all the subcubes, only one, much smaller compact data cube is stored. The compact data cube usually fits in one or small number of disk blocks.

Approximation Advantages   Storage space for both the precomputation and the storage of the precomputed data cube.   Even when a huge amount of storage space is avaliable and all the data cubes can be stored comfortably, it may take too long to answer a range sun query, since all cells covered by the range need to be accessed. This approach is preferable to the traditional approaches in two important respects:

I/O Model The convential parallel disk model Restriction: I=1

The Method Outline 1. Decomposition 2. Thresholding 3. Reconstruction The method can be divided into three sequential phases:

Decomposition Compute the wavelet decomposition of the multidimensional array S Compute the wavelet decomposition of the multidimensional array S Obtaining a set of C’ wavelet coefficients (C’ ~ N z ) Obtaining a set of C’ wavelet coefficients (C’ ~ N z ) As in practice, it is assumed that the array is very sparse

Thresholding and Ranking Keep only C (  C’) wavelet coefficients corresponds to the desired storage usage and accuracy. Keep only C (  C’) wavelet coefficients corresponds to the desired storage usage and accuracy. Rank only the C wavelet coefficients according to their importance in the context of accurately answering typical aggregation queries. Rank only the C wavelet coefficients according to their importance in the context of accurately answering typical aggregation queries. The C ordered coefficients compose the compact data cube. The C ordered coefficients compose the compact data cube.

Reconstruction Notes: J More accurate answers can be provided upon request. J The efficiency is crucial, since it affects the query response time directly. In the on line phase, an aggregation query is processed by using the K most significant coefficients to reconstruct an approximate answer. In the on line phase, an aggregation query is processed by using the K most significant coefficients to reconstruct an approximate answer. The choice of K depends upon the time the user is willing to spend. The choice of K depends upon the time the user is willing to spend.

Wavelet Decomposition Wavelets are a mathematical tool for the hierarchical decomposition of functions in a space efficient matter. HAAR Wavelets: Conceptually very simple wavelet basis functions Conceptually very simple wavelet basis functions fast to compute fast to compute easy to implement easy to implement

HAAR Wavelet - Example Suppose we have a one dimensional signal of N=8 data items S = [2,2,0,2,3,5,4,4] By repeating this process recursively on the average, we get the full decomposition: 0,-1,-1,0 [2,1,4,4,0,-1,-1,0] Wavelet transform

Wavelet Transform l l The individual entries are called the wavelet coefficients. l l Coefficients at the lower resolution are weighted more than the one at the higher resolution. l The decomposition is very efficient:  O(n) CPU time  O(N/B) I/Os The wavelet transform is a single coefficient representing the over all average of the original signal, followed by the detail coefficients in the order of increasing resolution Increasing resolution

1.Partition the d dimensions into g groups, for some 1  g  d Building The Compact Data Cube The goal of this step is to compute the wavelet decomposition of the multidimensional array S, obtaining a set of C’ wavelet coefficients. Where i 0 =0 i g =d G j must satisfy 2. The algorithm for constructing the compact data cube consists of g passes:  G j is read into memory  multidimensional decomposition is performed  results are written out to be used for the next pass

Eliminating Intermediate Results One problem is that the density of the intermediate results will increase from pass to pass, since performing wavelet decomposition on sparse data usually results in more nonzero data. The natural solution is truncation keeping roughly only N z entries Learning process: During each pass, an on-line statistics of wavelet coefficients are kept to maintain cutoff value. Any entry with its absolute value below the cutoff value will be thrown away on the fly.

Thresholding and Ranking Given the storage limitation for the compact data cube, it is possible to keep only several number of wavelet coefficients: let C’ - number of wavelet coefficients. C - number of wavelet coefficients that can be stored. Since C<<C’, the goal is to determine which are the best C coefficients to keep, so as to minimize the error of approximation.

P-norm Once the error rate is decided for individual queries, it is meaningful to choose a norm by which to measure the error of a collection of queries. let be the vector of error over a sequence of Q queries.

Choosing the Coefficients Choosing the C largest (absolute value) wavelet coefficients after normalization is provably optimal in minimizing the 2-norm. But if coefficient C i is more likely to contribute more than another one then its w(C) will be greater, where: Finally: 1. Pick C’’ (C<C’’<C’) largest wavelet coefficients 2. Among the C’’ coefficients choose the C with the largest weight 3. Order the C coefficients in decreasing order to get the compact data cube.

Answering On-Line Queries l l Mirrors the wavelet transform It is bottom up process l l S(l:H) denotes the range sum between s(l) and s(h) The error tree is built based upon the wavelet transform procedure.

Constructing The Original Signal The original signal S can be constructed from the tree nodes by the following formulas: Not all terms are always being evaluated, only the true contributors are quickly evaluated for answering a query.

Answering A Query To answer a query form Using k coefficients, Of the compact data cube R, the following algorithm is used: AnswerQuery(R,k,l 1, h 1,…, l d’, h d’ ) answer = 0; for I=1,2…k do if Contribute(R[I], l 1, h 1,…, l d’, h d’ ) answer=answer + Compute_Contribute (R[I], l 1, h 1,…, l d’, h d’ ) for j=d’+1,….,d do answer = answer x |Dj| return answer ;

Experiments Description The experimental results were performed using real-world data from the U.S. Census Bureau. The data file contains 372 attributes. Measure attribute is income. Functional attributes include among others: age, sex, education, race, origin. Although the dimensions size are generally small, the high dimensionality results in 10-dimensional array with more than 16,000,000 cells, density~0.001, N z=15,985. Platform: Digital Alpha work station running Digital unix MB internal memory (only 1-10 MB are used for the program) logical block transfer size 2*4 KB

Experiments Sets - variable density l l Dimensions groups were partitioned to satisfy M/2B condition For all data sets g=2 l l the small differences in running time were mainly caused by the on-line cutoff effect.

Experiments Sets - fixed density Running time scales almost linearly with respect to the input data size

Accuracy of the Approximations Answers l l Comparison with traditional histogram has no meaning, because they are too inefficient to construct for high dimensional data. l l Comparison with random sampling algorithms depends on the distribution of the non zero entries (random sampling performs better for uniform distribution).

Summery A new wavelet technique for approximate answer to an OLAP range sum queries was presented. Four important issues were discussed and resolved: I/O efficiency of the data cube construction, especially when the underlying multidimensional array is very sparse. Response time in answering an on-line query Accuracy in answering typical OLAP queries. Progressive refinement