PrefixCube: Prefix-sharing Condensed Data Cube Jianlin FengQiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. Nov 12, 2004.

Slides:



Advertisements
Similar presentations
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Advertisements

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
C-Store: Self-Organizing Tuple Reconstruction Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Apr. 17, 2009.
Fast Algorithms For Hierarchical Range Histogram Constructions
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Maintaining Sliding Widow Skylines on Data Streams.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Data Integration Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented.
1 Refining the Basic Constraint Propagation Algorithm Christian Bessière and Jean-Charles Régin Presented by Sricharan Modali.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
Cube Tree Dimension: number of group-by values Relation tuples map to a point in the space Aggregates: projection of all data points on all the subspaces.
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
ITIS 5160 Indexing. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
Presented by Cathrin Weiss, Panagiotis Karras, Abraham Bernstein Department of Informatics, University of Zurich Summarized by: Arpit Gagneja.
Efficient Computation of the Skyline Cube Yidong Yuan School of Computer Science & Engineering The University of New South Wales & NICTA Sydney, Australia.
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Hierarchical Dwarfs for the Rollup-Cube Yannis Sismanis Antonios Deligiannakis Yannis Kotidis Nick Roussopoulos.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
CURE for Cubes: C ubing U sing a R OLAP E ngine Konstantinos Morfonios Yannis Ioannidis University of Athens VLDB 2006.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.
Conjunctive Filter: Breaking the Entropy Barrier Daisuke Okanohara *1, *2 Yuichi Yoshida *1*3 *1 Preferred Infrastructure Inc. *2 Dept. of Computer Science,
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Efficient Processing of Top-k Spatial Preference Queries
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
QED: A Novel Quaternary Encoding to Completely Avoid Re-labeling in XML Updates Changqing Li,Tok Wang Ling.
Dr. Sudharman K. Jayaweera and Amila Kariyapperuma ECE Department University of New Mexico Ankur Sharma Department of ECE Indian Institute of Technology,
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Presenters : Virag Kothari,Vandana Ayyalasomayajula Date: 04/21/2010.
Dense-Region Based Compact Data Cube
Tian Xia and Donghui Zhang Northeastern University
ITIS 5160 Indexing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Parallel Databases.
Efficient Methods for Data Cube Computation
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
A paper on Join Synopses for Approximate Query Answering
TT-Join: Efficient Set Containment Join
BGP update profiles and the implications for secure BGP update validation processing Geoff Huston PAM April 2007.
Spatio-Temporal Databases
Indexing and Hashing Basic Concepts Ordered Indices
Chapter 4: Data Cube Computation and Data Generalization
Continuous Density Queries for Moving Objects
Online Analytical Processing Stream Data: Is It Feasible?
Efficient Processing of Top-k Spatial Preference Queries
Efficient Aggregation over Objects with Extent
Presentation transcript:

PrefixCube: Prefix-sharing Condensed Data Cube Jianlin FengQiong Fang Hulin Ding Huazhong Univ. of Sci. & Tech. Nov 12, 2004

DOLAP Jianlin Feng Outline l Introduction l Related Work l ODM: Ordered Datacube Model l BST-Condensed Cube l Prefix-sharing Condensed Cube l Comparisons l Conclusions

DOLAP Jianlin Feng Introduction l Data Cube (ICDE’96) –N-dimensional cube(A1, A2, …, A N ) –2 N cuboids, i.e. GROUP-BYs l The Huge Size Problem –When R is sparse, the size of a cuboid is possibly close to the size of R. –The I/O cost even for storing the cube result tuples becomes dominative.

DOLAP Jianlin Feng Related Work l Condensed Cube (ICDE’02) l Dwarf (SIGMOD’02) l Quotient Cube (VLDB’02) l QC-Tree (SIGMOD’03) l Basic idea: remove redundancies existing among cube tuples. –prefix redundancy –suffix redundancy

DOLAP Jianlin Feng Prefix redundancy l Given an example cube(A, B, C) –Each value of dimension A occurs in 4 cuboids: cuboid(A), (AB), (AC) and (ABC) –Possibly many times in each cuboid except cuboid(A) l Inter-cuboid and Intra-cuboid prefix redundancy

DOLAP Jianlin Feng Suffix Redundancy l Occurs when cube tuples belonging to different cuboids are actually aggregated from the same group of base relation tuples. l An extreme case –Let the source relation R have only one single tuple r(a 1, a 2, …, a n, m); –2 n cube tuples can be condensed into one physical tuple: (a 1, a 2, …, a n, V), where V = aggr(r); –together with some information indicating that it is a representative tuple.

DOLAP Jianlin Feng Thinking… l Condensed cube –It condenses those cube tuples, aggregated from one single base tuple, into a physical tuple in order to reduce cube’s size. l Dwarf –Besides suffix coalescing, i.e. multi-base- tuple condensing, it also realized full prefix- sharing so as to achieve high cube size reducing effectiveness.

DOLAP Jianlin Feng Motivation l HOW to further reduce condensed cube’s size while taking into account query characteristics we intend to answer - range query? l Augmenting BST-condensing with removing of intra-cuboid prefix redundancy!

DOLAP Jianlin Feng Ordered Datacube Model l Value ALL(or *) is encoded as 0. l A dimension D and its cardinality C –each dimension value is one-to-one mapped to an integer value between 1 and C inclusively. l N dimensions form a N-dimensional space. l The origin O(0, 0, …, 0) represents the grand total.

DOLAP Jianlin Feng Ordered Datacube Model l Under ODM, a range query against a data cube can actually be reduced to a sub-query against only one particular cuboid in the cube or a union of such sub-queries.

DOLAP Jianlin Feng BST-Condensed Cube l Base Single Tuple (BST) –t1 is a BST on SD {A} and {B} –t2 is a BST on SD {B} l A unique minimal BST-Condensed Cube can be got when fully taking advantage of each BST with all of its SDs - MinCube.

DOLAP Jianlin Feng BU-BST Condensed Cube l BottomUpBST algorithms (ICDE’02) l Each BST corresponds to only one SD. l It’s easier to compute and to restore normal cube tuple from condensed cube compared with MinCube. Note: BST Condensing is a special kind of Prefix-sharing ! A group of cube tuples with sharing prefix are represented by a BST!

DOLAP Jianlin Feng A BU-BST Condensed Cube Example Note: Intra-cuboid prefix redundancy: ct3 and ct4 Inter-cuboid prefix redundancy: ct2, ct3 and ct5

DOLAP Jianlin Feng Prefix-sharing Condensed Cube - PrefixCube BST Condensing + Intra-cuboid prefix-sharing Intra-cuboid prefix-sharing Prefix-sharing PrefixCube

DOLAP Jianlin Feng A PrefixCube Example

DOLAP Jianlin Feng Corresponding Dwarf

DOLAP Jianlin Feng PrefixCube vs. Dwarf PrefixCubeDwarf Prefix-sharingIntra-cuboidInter- and Intra-cuboid PrefixCube does not aim at blindly achieving effective compression ratio, but it is intended to make a good compromise among cube size reducing ratio, restoring and updating costs, and query characteristics! Suffix Coalescing BST Condensing Multi-tuple Condensing Compression Ratio LowerHigher Saving extra value ALL? NoYes Tuple clustered by cuboid? YesNo

DOLAP Jianlin Feng Effectiveness of Size Reduction l Datasets –synthetic datasets with uniform distribution –# of tuples: 1,000,000 (a) Cardinality = 100 (b) Cardinality = 1000

DOLAP Jianlin Feng Effectiveness of Size Reduction l PrefixBUC –Full Cube (computed by BUC) –Prefix-sharing

DOLAP Jianlin Feng Impact of Data Density l Datasets –Uniform distribution –# of dimensions: 6 –Cardinality of dimensions: 100 –# of tuples: range from 1,000 to 1,000,000

DOLAP Jianlin Feng Impact of Data Skewness l Datasets –Zipf distribution –# of tuples: 1,000,000 –Cardinality of dimensions: range from 1,000 to 500 with 100 interval –Zipf factor: range from 0 to 0.8 with 0.2 interval

DOLAP Jianlin Feng Real-world Dataset l Datasets –Weather Datasets –# of tuples: 1,015,367

DOLAP Jianlin Feng Conclusion l A new cube structure PrefixCube was proposed by augmenting BU-BST condensing with intra-cuboid prefix- sharing. –It can greatly reduce data cube’s size compared with BU-BST condensed cube. –It can also reduce the impact of data skew on BU-BST condensing. –It can make a quite stable size reduction on both dense and sparse datasets.

DOLAP Jianlin Feng The End Thank u! Any question?