SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.

Slides:



Advertisements
Similar presentations
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Multidimensional Indexing
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Multidimensional Data
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Implementation & Computation of DW and Data Cube.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. An index on an attribute A of a relation is a data structure that makes it efficient to find those tuples that have a fixed value for attribute.
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Spatial Indexing I Point Access Methods.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
IST722 Data Warehousing Business Intelligence Development with SQL Server Analysis Services and Excel 2013 Michael A. Fudge, Jr.
On-Line Analytic Processing Chetan Meshram Class Id:221.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Ahsan Abdullah 1 Data Warehousing Lecture-11 Multidimensional OLAP (MOLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
OnLine Analytical Processing (OLAP)
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Introduction to Indexes. Indexes An index on an attribute A of a relation is a data structure that makes it efficient to find those tuples that have a.
Data Warehousing.
Histograms for Selectivity Estimation
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
1 On-Line Analytic Processing Warehousing Data Cubes.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
SF-Tree: An Efficient and Flexible Structure for Estimating Selectivity of Simple Path Expressions with Accuracy Guarantee Ho Wai Shing.
What is OLAP?.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
1. 2  Introduction  Array Operations  Number of Elements in an array One-dimensional array Two dimensional array Multi-dimensional arrays Representation.
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Data Warehouses and OLAP 1.  Review Questions ◦ Question 1: OLAP ◦ Question 2: Data Warehouses ◦ Question 3: Various Terms and Definitions ◦ Question.
Or How I Learned to Love the Cube…. Alexander P. Nykolaiszyn BLOG:
A Decision Tree Approach to Cube Construction Patrick Kelly.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Dense-Region Based Compact Data Cube
CS 540 Database Management Systems
Updating SF-Tree Speaker: Ho Wai Shing.
On-Line Analytic Processing
On-Line Analytic Processing
Spatial Indexing I Point Access Methods.
Database Management Systems (CS 564)
Dynamic Programming.
Database Systems (資料庫系統)
Presentation transcript:

SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing

Overview Introduction Basic OLAP technologies – ROLAP and MOLAP Structure of an SF-Tree Using SF-Tree for OLAP Conclusion

Introduction On-line Analytical Processing (OLAP) is an important tool for decision making Users may ask for the aggregated measure attributes for different combination of dimension attributes [GrayBLP96].

Introduction -- OLAP e.g., in the CarSales table of a data warehouse: CarSales(TransID, Buyer, Date, Shop, Color, Price) The users may want to know the total sales of the yellow cars sold in Sept, 2002.

Introduction -- OLAP

i.e., answer = $60k To answer this efficiently, usually the answers are precomputed and stored A popular data model is called data cube [GrayBLP96].

Introduction -- OLAP

Note that we're referring to the model only, not every entry is materialized every combination of dimensions are included (so, here, it's just one example cuboid, other cuboids include,, etc)

Introduction -- OLAP Research issues – To store this information with high space efficiency and/or high query speed.

ROLAP and MOLAP

In a data cube, we have to store the combinations of dimension values and the associated aggregate values. ROLAP (Relational OLAP) stores the entries of the mapping in relational tables. MOLAP (Multi-dimensional OLAP) stores the entries in a multi-dimensional array.

ROLAP and MOLAP A materialized MOLAP multidimensional array (for colour and shop)

ROLAP and MOLAP A ROLAP table that stores the entries (for colour and shop)

ROLAP and MOLAP MOLAP adv: quick: 1 retrieval for 1 point query (i.e., use the dimension values to calculate the address of the aggregated value stored) may be space efficient if the cube is very dense (since all dimensions are not explicitly stored in every tuple)

ROLAP and MOLAP MOLAP disadv: space inefficient if the cube is sparse (has many zero entries) esp. for high dimensional cases. eased by chunking may need a lot of scans if we issue a large range query (i.e., involves many dimension values)

ROLAP and MOLAP ROLAP: index are built on the table to improve query performance e.g., B + -Tree on each dimension, or R-Tree over all points. ROLAP Adv: space efficient (non-zero entries are not stored)

ROLAP and MOLAP ROLAP Disadv: indexes, such as R-Tree, may not be effective in high-D data Many joins are required to produce the result (if single D indexes are used) Intermediate result may be large Can we do better?

SF-Tree

stands for Signature File Tree stores a mapping from objects to integer flexibly, efficient and has a statistical accuracy guarantee

SF-Tree Basic Idea: divide the objects into groups of the same (or similar) associated number. checking the associated number of an object is the same as checking which group this object belongs. signature files are used to improve the efficiency of existence checking, trees are used to improve accuracy and speed.

SF-Tree

Properties (Adv): Space efficient, independent of object size Flexible, can have a tradeoff among space, speed and accuracy Speed is independent of number of objects

Using SF-Tree for OLAP

The information in OLAP can be modeled as a mapping from objects (dimension values) to numbers (aggregate values). Thus we can use SF-Tree to store this mapping.

Using SF-Tree for OLAP e.g., (TST, Yellow) is an object, it's associated number is 10k we can insert it into SF-Tree space requirement: m/ln2 bits per object per level independent of object size (i.e., dimensionality) smaller than ROLAP esp. for high-D

Using SF-Tree for OLAP Adv: more space efficient than ROLAP (definitely much better than MOLAP) quicker than ROLAP in point queries (no need to do joins) Disadv: range queries require scanning all possible points in query range (as in MOLAP).

Using SF-Tree for OLAP To avoid the disadvantage, we borrow the idea from MRA-Tree [LazM01] MRA-Tree (Multi-Resolution Aggregate Tree): in a data/space partitioning tree, add aggregates in all internal nodes. one example is quad-tree + aggregates.

MRA-Tree

580k 60k 160k

MRA-Tree For answering range queries, the number of accesses is reduced Extra space is required Leaf nodes may not contain only 1 record The tree size drop significantly if we increase the number of points in a leaf node page.

MRA-Tree

SF-Tree with MRA-Tree SF-Tree is more space efficient than ROLAP Use SF-Tree to store leaf nodes => each page thus can store more points => tree size/depth is reduced => less page accesses in query

SF-Tree with MRA-Tree Adv: more space efficient, i.e., may be small enough to fit in memory and reduce page accesses, esp. for high-D data Disadv: still need to scan the area in leaf nodes (vs. scanning data points while using ROLAP)

Conclusion SF-Tree is space efficient, can be used to store a data cube. (or may be used as a ROLAP index) Though in analysis the speed of SF-Tree is poor for range queries, we try to incorporate the idea in MRA-Tree on SF- Tree to increase the speed.

Reference [GrayBLP96] J. Gray, A. Bosworth, A. Layman, and H. Piramish. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In ICDE'96. [LazM01] I. Lazaridis, and S. Mehrotra. Progressive Approximate Aggregate Queryies with a Multi-Resolution Tree Structure. In SIGMOD'01.