Dense-Region Based Compact Data Cube

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Fast Algorithms For Hierarchical Range Histogram Constructions
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimal Workload-Based Weighted Wavelet Synopsis
Deterministic Wavelet Thresholding for Maximum-Error Metrics Minos Garofalakis Bell Laboratories Lucent Technologies 600 Mountain Avenue Murray Hill, NJ.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
RACE: Time Series Compression with Rate Adaptivity and Error Bound for Sensor Networks Huamin Chen, Jian Li, and Prasant Mohapatra Presenter: Jian Li.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.
Clustering using Wavelets and Meta-Ptrees Anne Denton, Fang Zhang.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
1 Flexible Data Cube for Range-Sum Queries in Dynamic OLAP Data Cubes Authors: C.-I Lee and Y.-C. Li Speaker: Y.-C. Li Date :Dec. 19, 2002.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Tian Xia and Donghui Zhang Northeastern University
Data Transformation: Normalization
BlinkDB.
Data Mining Soongsil University
Fast multiresolution image querying
Parallel Databases.
BlinkDB.
Efficient Methods for Data Cube Computation
Data-Streams and Histograms
Spatial Indexing I Point Access Methods.
Hash-Based Indexes Chapter 11
Hashing CENG 351.
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
ICICLES: Self-tuning Samples for Approximate Query Answering
Spatial Online Sampling and Aggregation
K Nearest Neighbor Classification
Image Segmentation Techniques
File Organizations and Indexing
Hash-Based Indexes Chapter 10
Wavelet Transform (Section )
SPACE EFFICENCY OF SYNOPSIS CONSTRUCTION ALGORITHMS
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
Support Vector Machines
Data Transformations targeted at minimizing experimental variance
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Wavelet-based histograms for selectivity estimation
Data Structures – Week #7
Continuous Density Queries for Moving Objects
Online Analytical Processing Stream Data: Is It Feasible?
Chapter 11 Instructor: Xin Zhang
The Skyline Query in Databases Which Objects are the Most Important?
Donghui Zhang, Tian Xia Northeastern University
Presented by: Mariam John CSE /14/2006
Efficient Aggregation over Objects with Extent
Presentation transcript:

Dense-Region Based Compact Data Cube Presented by: Kan Kin Fai

Outline Background Introduction to Compact Data Cube Pros and cons of the Compact Data Cube method Dense-Region Based Compact Data Cube

Background Why is a data cube? Some pre-computed aggregates on the underlying data warehouse. System constraints on materializing data cube(s) Disk space, maintenance cost, etc. Common approach: materialize parts of a data cube. Alternative: use approximation technique Reason: OLAP applications accept approximate answers in many scenarios.

Introduction to Compact Data Cube Compact Data Cube was proposed by Vitter and Wang in Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets (SIGMOD 99). Main Ideas Offline phase: perform Haar wavelet transform on the underlying data (i.e. the base cuboid) and store the k most significant coefficients. Online phase: process any given query based on the k most significant coefficients.

Introduction to Compact Data Cube Basics of Haar wavelet transform Building Compact Data Cube Thresholding and Ranking Answering On-Line Queries

Introduction to Compact Data Cube Basics of Haar wavelet transform e.g. S = [2, 2, 0, 2, 3, 5, 4, 4] Resolution Averages Detail Coefficients 8 [2, 2, 0, 2, 3, 5, 4, 4] 4 [2, 1, 4, 4] [0, -1, -1, 0] 2 [1.5, 4] [0.5, 0] 1 [2.75] [-1.25]

Introduction to Compact Data Cube Basics of Haar wavelet transform For compression reasons, the detail coefficients are normalized. The coefficients at the lower resolutions are weighted more heavily. Approximates the original signal by keeping only the most significant coefficients. Requires only O(N) CPU time and O(N/B) I/Os to compute for a signal of N values. Multidimensional wavelet transform: a series of one-dimensional wavelet transforms.

Introduction to Compact Data Cube Building the Compact Data Cube Problem 1: the size of the multidimensional array representing the underlying data is too large (assume the data are very sparse). Solution: Divide the wavelet transform process into multiple passes.

Introduction to Compact Data Cube Building the Compact Data Cube Problem 2: The density of the intermediate results would increase from pass to pass. Solution: truncate the intermediate multidimensional array by cutting off entries with small magnitude. I/O complexity:

Introduction to Compact Data Cube Thresholding and Ranking Choice 1: keep the C largest (in absolute value) wavelet coefficients. Choice 2: keep the C wavelet coefficients with the largest weights among the C’ largest coefficients (C < C’). The weight of a coefficient equals to the number of its dimensions with value zero.

Introduction to Compact Data Cube Answering On-Line Queries Space: ((d+1)k), CPU time:

Pros and cons of the compact data cube method Requires little disk spaces (a small number of disk blocks). Responds to on-line query fast. Answers OLAP queries more accurately than other approximation techniques like histogram and random sampling. Can progressively refine the approximate answer with no added overhead.

Pros and cons of the compact data cube method Approximates a vast amount of useless empty cells in base cuboid together with useful non-empty cells in base cuboid. Needs to cut off entries with small magnitude at the end of each pass in order to maintain a constant amount of I/O operations from pass to pass.

Dense-Region Based Compact Data Cube Aim: To enhance the ability of the compact data cube method to handle datasets having dense-regions-in-sparse-cube property. Main Idea: To exclude empty cells in base cuboid from approximation. Two-phase approach Compute dense regions in base cuboid. Approximate each dense region independently.

Dense-Region Based Compact Data Cube Question 1: how can we find the dense regions efficiently? Efficient DEnse region Mining (EDEM) algorithm proposed by Cheung et al. in DROLAP -- A Dense-Region-Based Approach to On-line Analytical Processing (DEXA99)

Dense-Region Based Compact Data Cube Basic ideas of EDEM: Build a k-d tree to store the valid cells. Grow dense region covers along boundaries. Search dense regions among the covers. Complexity of EDEM: linear to the number of dimensions and sub-quadratic to the number of data points.

Dense-Region Based Compact Data Cube Question 2: how should we allocate disk space in approximating the dense regions? Choice 1: allocate disk space equally to each dense regions. Choice 2: allocate disk space according to the sizes of dense regions. Choice 3: order the wavelet coefficients of all the dense regions and keep the most significant ones (in absolute value).

Dense-Region Based Compact Data Cube Question 3: how should we treat the data points outside the dense regions? Keep all or keep only significant ones. Question 4: how do we answer on-line queries using the dense-region based approach? Check if a dense region covered by the given query. Check if the stored coefficients contribute to the range sum and compute the amount of contribution if needed.

Dense-Region Based Compact Data Cube One favorable side effect: we may parallelize the construction of compact data cube. More questions: How can we handle updates to the underlying data? How can we approximate iceberg cube? Can we apply the idea of compact data cube to iceberg cube? Can compact data cube be used to answer other types of OLAP queries besides range-sum?