Chapt. 7 Multidimensional Hierarchical Clustering

Slides:



Advertisements
Similar presentations
Vorlesung Datawarehousing Table of Contents Prof. Rudolf Bayer, Ph.D. Institut für Informatik, TUM SS 2002.
Advertisements

Dimensional Modeling.
Data Warehousing and Decision Support, part 2
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases Presented by Darren Gates for ICS 280.
Transbase® Hypercube: A leading-edge ROLAP Engine supporting multidimensional Indexing and Hierarchy Clustering Roland Pieringer Transaction Software GmbH.
BTrees & Bitmap Indexes
Quick Review of Apr 17 material Multiple-Key Access –There are good and bad ways to run queries on multiple single keys Indices on Multiple Attributes.
Chapter 3: Data Storage and Access Methods
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary.
By N.Gopinath AP/CSE. Two common multi-dimensional schemas are 1. Star schema: Consists of a fact table with a single table for each dimension 2. Snowflake.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
CS346: Advanced Databases
Data Warehousing.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
Prof. Bayer, DWH, Ch.4, SS Chapter 4: Dimensions, Hierarchies, Operations, Modeling.
Data Warehousing.
Using SQL to Query Oracle OLAP Cubes Bud Endress Director of Product Management, OLAP.
BI Terminologies.
Reducing the Response Time for Data Warehouse Queries Using Rough Set Theory By Mahmoud Mohamed Al-Bouraie Yasser Fouad Mahmoud Hassan Wesam Fathy Jasser.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Prof. Bayer, DWH, CH. 4.5, SS Chapt.4.5 Modeling of Features of Dimensions Within a dimension hierarchy, elements at the same level may have different.
Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.
Designing a Data Warehousing System. Overview Business Analysis Process Data Warehousing System Modeling a Data Warehouse Choosing the Grain Establishing.
UNIT-II Principles of dimensional modeling
Indexes and Views Unit 7.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 3.2 Basic Concepts of the MDD-Model
Chapter 5 Index and Clustering
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Prof. Bayer, DWH, Ch.7, SS20021 Chapt. 7 Multidimensional Hierarchical Clustering Fig. 3.1 Hierarchies in the `Juice and More´ schema Year (3) Month (12)
DWH, Prof. Bayer, SS Caller Prefixsmallint100 Numberinteger10 7 Namestring Adress...string... Callee Prefixsmallint100 Numberinteger10 7 Namestring.
All DBMSs provide variations of b-trees for indexing B-tree index
Just Enough Database Theory for Power Pivot / Power BI
Spatial Data Management
Parallel Databases.
Multidimensional Access Structures
Database System Implementation CSE 507
A multi-dimensional data model
Star Schema.
Relational Algebra Chapter 4, Part A
Examples of Physical Query Plan Alternatives
CS222P: Principles of Data Management Notes #11 Selection, Projection
Chapter 4: Dimensions, Hierarchies, Operations, Modeling
Lecture 15: Bitmap Indexes
Indexing and Hashing Basic Concepts Ordered Indices
The Relational Model Textbook /7/2018.
Retail Sales is used to illustrate a first dimensional model
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Lecture 2- Query Processing (continued)
Advance Database Systems
Retail Sales is used to illustrate a first dimensional model
Dimensional Model January 16, 2003
Chapter 2: Intro to Relational Model
CS222: Principles of Data Management Notes #11 Selection, Projection
Prof. R. Bayer, Ph.D. Dr. Volker Markl
Slides based on those originally by : Parminder Jeet Kaur
Chapter 10.1: UB-tree for Multidimensional Indexing
Course Instructor: Supriya Gupta Asstt. Prof
Chapter 6: UB-tree for Multidimensional Indexing
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
Ch. 16: Sweep-Zones Basic Question: Is it possible to compute nearest neighbors in expected time O(n*log(n)) ??? Basic Idea: Generalize sweep-lines to.
Unit 12 Index in Database 大量資料存取方法之研究 Approaches to Access/Store Large Data 楊維邦 博士 國立東華大學 資訊管理系教授.
Presentation transcript:

Chapt. 7 Multidimensional Hierarchical Clustering Fig. 3.1 Hierarchies in the `Juice and More´ schema Year (3) Month (12) TIME Region (8) Nation (7) Trade Type (2) Business Type (7) CUSTOMER Type (5) Brand (8) Category (19) Container (10) PRODUCT Sales Organization (5) Distribution Channel (3) DISTRIBUTION All Products All Distributions All Customer All Time Prof. Bayer, DWH, Ch.7, SS2000

... 36 PRODKEY CUSTKEY DISTKEY TIMEKEY SALES DISTCOST PRODUCT 2180 rows TYPE BRAND CATEGORY CONTAINER ... CUSTOMER 7064 rows REGION NATION TRADE-TYPE BUSINESS-TYPE DISTRIBUTION 12 rows SALESORG CHANNEL TIME 36 YEAR MONTH FACT 26M rows (b) Prof. Bayer, DWH, Ch.7, SS2000

Size of completely aggregated Cube (6*9*20*11)*(9*8*3*8)*(6*4)*(4*13) ------------------------------------------------ = (5*8*19*10)*(8*7*2*7)*(5*3)*(3*12) 4*6*6*9*11*13 185.328 -------------------- = ----------- = 7.96 larger than base cube 5*5*7*7*19 23.275 Base Cube has 2.245.024.000 cells * 4 B ~ 9 GB Number of available facts: 26 million Prof. Bayer, DWH, Ch.7, SS2000

Sparsity: 26*106 -------------- = 0,0116 2,245* 109 -------------- = 0,0116 2,245* 109 100 - 1.16 = 98.84 % sparsity Prof. Bayer, DWH, Ch.7, SS2000

Hierarchically aggregated Cube (1+5+40+760+7600) = 8406 (1+5+40+760+7600) = 8406 (1+8+56+112+784) = 961 (1+5+15) = 21 (1+3+24) = 28 P = 4.749.961.608 Size of base cube 2.145.024.000 Number of aggregate cells 2.504.937.608 ==> Juice and More database has 96 times more hierarchically aggregated cells than occupied base cells! Prof. Bayer, DWH, Ch.7, SS2000

In addition: grouping, computation of aggregates, sorting of results. Star-Joins Restrictions on several dimension tables, which are then joined with fact table In addition: grouping, computation of aggregates, sorting of results. Example: Select <MEASURE AGGREGATION> From Fact F, Customer C, DISTRIBUTION D, Product P, Time T Where F. ProdKey = P. AND CustKey = C. F.TIMEKEY = T.TIMEKEY AND F.DISTKEY = D.DISTKEY AND <CUSTOMER RESTRICTION> AND <DISTRIBUTION RESTRICTION> AND <PRODUCT RESTRICTION> AND <TIME RESTRICTION> Prof. Bayer, DWH, Ch.7, SS2000

<MEASURE AGGREGATION> Select <MEASURE AGGREGATION> From Fact F Where F. ProdKey BETWEEN Pkey1 AND Pkey2 AND DistKey BETWEEN Dkey1 AND Dkey2 AND CustKey BETWEEN Ckey1 AND Ckey2 AND TimeKey BETWEEN Tkey1 AND Tkey2 Prof. Bayer, DWH, Ch.7, SS2000

How to compute star-joins efficiently? Key Question: How to compute star-joins efficiently? Secondary indexes on foreign keys of fact table (standard B-trees), see chapter 5 for details - intersect result lists retrieve tuples from fact table randomly Bitmaps Prof. Bayer, DWH, Ch.7, SS2000

Bitmap Index Intersection bitmap for organization 34 % of 1.....1.11 1.1...1.1. 1.1...1.1. ...1.1.... ..1.1...1. = „TM“ tuples bitmap for region 32 % of 11.1...... 1.11.....1 .1.1..1... 1.1.1..... .1..1.1... = „ Asia “ tuples result of bitmap intersection 10 % of 1......... 1.1....... ......1... .......... ....1..... tuples 80 % of accessed disk pages Page 1 Page 2 Page 3 Page 4 Page 5 pages (shaded) Bitmap Index Intersection Prof. Bayer, DWH, Ch.7, SS2000

Problem: for small result sets of a few %, almost all pages of the facts table must be fetched from disk, if the hits in the result set are not clustered on disk. Ex: with 8 KB pages 20 to 400 tuples per page, i.e. at 0.25% to 5% hits in the result almost all pages must be fetched. At least tuple clustering, preferably page clustering, are desirable, but how?? Goal: Code hierarchies in such a way, that for star-joins with the Fact table we have to join only with a query box on the Fact table Prof. Bayer, DWH, Ch.7, SS2000

Example Hierarchy in Member Set Representation Basic Idea for Multidimensional Clustering All 1L} 0.5L; Juice Apple 1L; OJ 0.7L; 0.33L; {OJ 1 = m All Products AppleJuice Orange Juice Apple Juice 1L} OJ 0.7L; 0.33L; {OJ 1 = m 1L} Juice Apple 0.5L { 1 2 = m Product Category 1 0,33L 1 0,7L 2 1L 0,5L 1 1L 0.33L} {OJ 2 1 = m 0.7L} {OJ 2 = m 1L} OJ { 2 3 = m 0.5L} {A-Juice 2 4 = m 1L} {A-Juice 2 5 = m Level Label Member Ordinal (e.g.,1) Member Label (e.g., 0.7L) Legend: Example Hierarchy in Member Set Representation Prof. Bayer, DWH, Ch.7, SS2000

Dimension D consists of Value Set V = [[ v1, v2, ... vn ]] Hierarchy H of height h consisting of h+1 hierarchy levels H = [[L0 , L1 ,..., Lh ]] Level Li is a set of sets = [[m1i, ..., mji ]] with mki elof V mki get names, e.g. „Orange Juice“ as label(m11), in general label(mki) Constraint: every mli+1 must be a subset of some mki Prof. Bayer, DWH, Ch.7, SS2000

Hierarchic Relationships The children of mki are all those sets mli+1 of the lower level i+1 with the property: mli+1 elof??? mki , formally: children(mki ) := [[mli+1 subsetof??? Li+1 : mli+1 subof??? mki ]] parent(mki ) := [[mli-1 subsetof??? Li-1 : mli-1 superof??? mki ]] Principle: the children of m are numbered by the bijective function ordm starting at 1 or 0 Prof. Bayer, DWH, Ch.7, SS2000

Enumeration and Surrogate Functions Let A be an enumeration type A = [[ a0, a1, ... ak ]] f : A --> (0, 1 ,..., k ) defined as f (ai ) = i then i is called the surrogate of ai Prof. Bayer, DWH, Ch.7, SS2000

Hierarchies and composite Surrogates Basic Idea: concatenate the surogates of successive hierarchy levels (compound surrogates cs) Note: the root ALL of the hierarchy is not encoded Def: compound surrogate cs for hierarchy H ordm : children (m) --> [[0, 1, ..., |children(m)| -1]] cs (H, mi) := ord father (mi) (mi) if i=1 :=cs (H, father ( mi)) comp??? ord father (mi) (mi) otherwise Prof. Bayer, DWH, Ch.7, SS2000

Example: REGION f(REGION) South Europe Middle Europe 1 Northern Europe Middle Europe 1 Northern Europe 2 Western Europe 3 North America 4 Latin America 5 Asia 6 Australia 7 (a) Prof. Bayer, DWH, Ch.7, SS2000

Surrogates for Region and the entire Costumer Hierarchy CUSTOMER South Europe North America Asia Retail Wholesale Kana ´s Sushi Bar Joe ‘s Sports Bar ... 4 6 2 1 USA Canada Australia 7 Surrogates for Region and the entire Costumer Hierarchy Prof. Bayer, DWH, Ch.7, SS2000

North America --> USA --> Retail --> Bar Example: the path North America --> USA --> Retail --> Bar has the compound surrogate 4?1?1?2 Next Idea: for every hierarchy level determine the higest branching degree (plus a safety margin for future extensions) and code by fixed number of bits. surrogates (H,i) := max [[ cardinality (children (H,m)) : m in??? level (H, i-1) ]] Prof. Bayer, DWH, Ch.7, SS2000

handgeschriebene Seite 6.6 ??? Problem mit doppelten Indices? Prof. Bayer, DWH, Ch.7, SS2000

Properties of MHC Encoding very compact coding of fixed length lexicographic order of composite keys remains, i.e. isomorphic to integer ordering point restrictions on arbitrary hierarchy levels lead to interval restrictions on the compound surrogates Prof. Bayer, DWH, Ch.7, SS2000

Example: path to USA is: North America --> USA 4 = 1002 1 = 0012 4 = 1002 1 = 0012 leads to range on cs: 100 001 0 0002 to 100 001 1 1112 and to the decimal range: 528 to 543 or [528 : 543] ==> star join with restriction North America.USA leads to an interval restriction on the fact table ==> point restrictions on arbitrary hierarchy levels of several dimensions lead to Query Boxes on the fact table. Prof. Bayer, DWH, Ch.7, SS2000

Complex Hierarchies time with months and weeks, both restrictions lead to intervals on the level of days Example of Fig. 4-4 proposal for multiple hierarchies: choose the most useful (depending on the query profile) or consider multiple hierarchies as several independent hierarchies. Caution, this increases the number of dimensions !!! Time variant hierarchies: extend by time interval of validity , see Example Fig. 4-5, Prof. Bayer, DWH, Ch.7, SS2000

Complex Hierarchy Graphs REGION YEAR NATION CUSTOMER TYPE MONTH WEEK TRADE TYPE CUSTOMER SIZE DAY CUSTOMER (b) (a) Complex Hierarchy Graphs Prof. Bayer, DWH, Ch.7, SS2000

Change of a hierarchy over the time CUSTOMER South Europe North America ... Canada USA Retail Wholesale Bar Restaurant Year <= 1997 Year > 1997 Joe ‘s Sports Bar Change of a hierarchy over the time Prof. Bayer, DWH, Ch.7, SS2000

Orange Juice Asia Prof. Bayer, DWH, Ch.7, SS2000

Processing a query box in sort order with the Tetris algorithm Apple Juice Asia Processing a query box in sort order with the Tetris algorithm Prof. Bayer, DWH, Ch.7, SS2000