Prof. Bayer, DWH, Ch.7, SS20021 Chapt. 7 Multidimensional Hierarchical Clustering Fig. 3.1 Hierarchies in the `Juice and More´ schema Year (3) Month (12)

Slides:



Advertisements
Similar presentations
Vorlesung Datawarehousing Table of Contents Prof. Rudolf Bayer, Ph.D. Institut für Informatik, TUM SS 2002.
Advertisements

Dimensional Modeling.
Indexing DNA Sequences Using q-Grams
Data Warehousing and Decision Support, part 2
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
0 Course Outline n Introduction and Algorithm Analysis (Ch. 2) n Hash Tables: dictionary data structure (Ch. 5) n Heaps: priority queue data structures.
Multidimensional Data
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases Presented by Darren Gates for ICS 280.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Transbase® Hypercube: A leading-edge ROLAP Engine supporting multidimensional Indexing and Hierarchy Clustering Roland Pieringer Transaction Software GmbH.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
BTrees & Bitmap Indexes
Quick Review of Apr 17 material Multiple-Key Access –There are good and bad ways to run queries on multiple single keys Indices on Multiple Attributes.
Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization.
Chapter 3: Data Storage and Access Methods
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Physical Data Warehouse Design Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Warehouse and Data Cube Lecture Notes for Chapter 3 Introduction to Data Mining By.
By N.Gopinath AP/CSE. Two common multi-dimensional schemas are 1. Star schema: Consists of a fact table with a single table for each dimension 2. Snowflake.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
CS346: Advanced Databases
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
Data Warehousing.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
CS 345: Topics in Data Warehousing Thursday, October 21, 2004.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
ITCS 6163 Lecture 5. Indexing datacubes Objective: speed queries up. Traditional databases (OLTP): B-Trees Time and space logarithmic to the amount of.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
1 B-Trees & (a,b)-Trees CS 6310: Advanced Data Structures Western Michigan University Presented by: Lawrence Kalisz.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
Prof. Bayer, DWH, Ch.4, SS Chapter 4: Dimensions, Hierarchies, Operations, Modeling.
Using SQL to Query Oracle OLAP Cubes Bud Endress Director of Product Management, OLAP.
BI Terminologies.
Reducing the Response Time for Data Warehouse Queries Using Rough Set Theory By Mahmoud Mohamed Al-Bouraie Yasser Fouad Mahmoud Hassan Wesam Fathy Jasser.
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
© 1999 FORWISS FORWISS MISTRAL Performance of TPC-D Benchmark and Datawarehouses Prof. R. Bayer, Ph.D. Dr. Volker Markl Dept. of Computer Science, Technical.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Prof. Bayer, DWH, CH. 4.5, SS Chapt.4.5 Modeling of Features of Dimensions Within a dimension hierarchy, elements at the same level may have different.
Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.
Designing a Data Warehousing System. Overview Business Analysis Process Data Warehousing System Modeling a Data Warehouse Choosing the Grain Establishing.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
UNIT-II Principles of dimensional modeling
Indexes and Views Unit 7.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 3.2 Basic Concepts of the MDD-Model
Chapter 5 Index and Clustering
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
File Organizations and Indexing
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Prof. Bayer, DWH, Ch.6, SS Chapter 6: UB-tree for Multidimensional Indexing Note: all relational databases are multidimensional: a tuple in a relation.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
DWH, Prof. Bayer, SS Caller Prefixsmallint100 Numberinteger10 7 Namestring Adress...string... Callee Prefixsmallint100 Numberinteger10 7 Namestring.
Chapter 4: Dimensions, Hierarchies, Operations, Modeling
Chapt. 7 Multidimensional Hierarchical Clustering
Chapter 10.1: UB-tree for Multidimensional Indexing
Chapter 6: UB-tree for Multidimensional Indexing
Ch. 16: Sweep-Zones Basic Question: Is it possible to compute nearest neighbors in expected time O(n*log(n)) ??? Basic Idea: Generalize sweep-lines to.
Presentation transcript:

Prof. Bayer, DWH, Ch.7, SS20021 Chapt. 7 Multidimensional Hierarchical Clustering Fig. 3.1 Hierarchies in the `Juice and More´ schema Year (3) Month (12) TIME Region (8) Nation (7) TradeType (2) BusinessType (7) CUSTOMER Type (5) Brand (8) Category (19) Container (10) PRODUCT Sales Organization (5) Distribution Channel (3) DISTRIBUTION All ProductsAll DistributionsAll CustomerAll Time

Prof. Bayer, DWH, Ch.7, SS20022 (b)

Prof. Bayer, DWH, Ch.7, SS20023 Size of completely aggregated Cube (6*9*20*11)*(9*8*3*8)*(6*4)*(4*13) = (5*8*19*10)*(8*7*2*7)*(5*3)*(3*12) 4*6*6*9*11* = = 7.96 larger than base cube 5*5*7*7* Base Cube has cells * 4 B ~ 9 GB Number of available facts:26 million

Prof. Bayer, DWH, Ch.7, SS20024 Sparsity: 26* =0,0116 2,245* =98.84 % sparsity

Prof. Bayer, DWH, Ch.7, SS20025 Hierarchically aggregated Cube ( )=8406 ( )= 961 (1+5+15)= 21 (1+3+24)= 28  = Size of base cube Number of aggregate cells ==> Juice and More database has 96 times more hierarchically aggregated cells than occupied base cells!

Prof. Bayer, DWH, Ch.7, SS20026 Star-Joins Restrictions on several dimension tables, which are then joined with fact table In addition: grouping, computation of aggregates, sorting of results. Example: Select FromFact F, Customer C, DISTRIBUTION D, Product P, Time T WhereF. ProdKey = P. ProdKey AND F. CustKey = C. CustKey AND F.TIMEKEY = T.TIMEKEY AND F.DISTKEY = D.DISTKEY AND AND

Prof. Bayer, DWH, Ch.7, SS20027 Select FromFact F WhereF. ProdKey BETWEEN Pkey1 AND Pkey2 AND F. DistKey BETWEEN Dkey1 AND Dkey2 AND F. CustKey BETWEEN Ckey1 AND Ckey2 AND F. TimeKey BETWEEN Tkey1 AND Tkey2

Prof. Bayer, DWH, Ch.7, SS20028 Key Question: How to compute star-joins efficiently? Secondary indexes on foreign keys of fact table (standard B-trees), see chapter 5 for details - intersect result lists -retrieve tuples from fact table randomly Bitmaps

Prof. Bayer, DWH, Ch.7, SS20029 Bitmap Index Intersection bitmap for organization = „TM“ bitmap for region = „ Asia “ Page 1 Page 2 Page 3 Page 4 Page 5 result of bitmap intersection accessed disk pages (shaded) 34 % of tuples 32 % of tuples 10 % of tuples 80 % of pages

Prof. Bayer, DWH, Ch.7, SS Problem: for small result sets of a few %, almost all pages of the facts table must be fetched from disk, if the hits in the result set are not clustered on disk. Ex: with 8 KB pages 20 to 400 tuples per page, i.e. at 0.25% to 5% hits in the result almost all pages must be fetched. At least tuple clustering, preferably page clustering, are desirable, but how?? Goal: Code hierarchies in such a way, that for star- joins with the Fact table we have to join only with a query box on the Fact table

Prof. Bayer, DWH, Ch.7, SS Basic Idea for Multidimensional Clustering 1L} 0.5L; Juice Apple 1L; OJ 0.7L ; OJ 0.33L; {OJ 0 1  m 1L} OJ 0.7L; OJ 0.33L; {OJ 1 1  m 0.5L} {A-Juice 2 4  m 1L} Juice Apple 0.5L Juice Apple { 1 2  m 0.33L} {OJ 2 1  m 0.7L} {OJ 2 2  m1L} OJ { 2 3  m1L} {A-Juice 2 5  m Orange JuiceApple Juice 0,33L 0,7L1L 0,5L Product Category All Products All Level Label Member Ordinal (e.g.,1) Member Label (e.g., 0.7L) Legend: Example Hierarchy in Member Set Representation AppleJuice 1 1L

Prof. Bayer, DWH, Ch.7, SS Dimension D consists of Value Set V= [[ v 1, v 2,... v n ]] Hierarchy H of height h consisting of h+1 hierarchy levelsH = [[L 0, L 1,..., L h ]] Level L i is a set of sets = [[m 1 i,..., m j i ]] with m k i  V m k i get names, e.g. „Orange Juice“ as label(m 1 1 ), in general label(m k i ) Constraint: every m l i+1 must be a subset of some m k i

Prof. Bayer, DWH, Ch.7, SS Hierarchic Relationships The children of m k i are all those sets m l i+1 of the lower level i+1 with the property: m l i+1  m k i, formally: children(m k i ):= [[m l i+1  L i+1 : m l i+1  m k i ]] parent(m k i ):= [[m l i-1  L i-1 : m l i-1  m k i ]] Principle: the children of m are numbered by the bijective function ord m starting at 1 or 0

Prof. Bayer, DWH, Ch.7, SS Hierarchic Relationships The children of m k i are all those sets m l i+1 of the lower level i+1 with the property: m l i+1  m k i, formally: children(m k i ):= [[m l i+1  L i+1 : m l i+1  m k i ]] parent(m k i ):= [[m l i-1  L i-1 : m l i-1  m k i ]] Principle: the children of m are numbered by the bijective function ord m starting at 1 or 0

Prof. Bayer, DWH, Ch.7, SS Enumeration and Surrogate Functions Let A be an enumeration type A = [[ a 0, a 1,... a k ]] f : A --> (0, 1,..., k )defined as f (a i ) = i then i is called the surrogate of a i

Prof. Bayer, DWH, Ch.7, SS Hierarchies and composite Surrogates Basic Idea: concatenate the surogates of successive hierarchy levels (compound surrogates cs) Note: the root ALL of the hierarchy is not encoded Def: compound surrogate cs for hierarchy H ord m : children (m) --> [[0, 1,..., |children(m)| -1]] cs (H, m i ) := ord father (mi) (m i ) if i=1 :=cs (H, father ( m i ))  ord father (mi) (m i ) otherwise

Prof. Bayer, DWH, Ch.7, SS Example: REGION f(REGION) South Europe0 Middle Europe1 Northern Europe2 Western Europe 3 North America4 Latin America5 Asia6 Australia7 (a)

Prof. Bayer, DWH, Ch.7, SS CUSTOMER South EuropeNorth AmericaAsia RetailWholesale Kana ´s´sSushiBar Joe‘s Sports Bar... Bar Retail USA Canada Australia 7 Wholesale 0 Surrogates for Region and the entire Costumer Hierarchy

Prof. Bayer, DWH, Ch.7, SS Example: the path North America --> USA --> Retail --> Bar has the compound surrogate 4  1  1  2 Next Idea: for every hierarchy level determine the higest branching degree (plus a safety margin for future extensions) and code by fixed number of bits. surrogates (H,i) := max [[ cardinality (children (H,m)) : m  level (H, i-1) ]]

Prof. Bayer, DWH, Ch.7, SS let l i :=  log 2 surrogates (H,i)  then l i bits are needed for the surrogates of level i let  be a path  = m 0  m 1  m 2 ...  m h to a leaf m h of hierarchy H:

Prof. Bayer, DWH, Ch.7, SS cs (H,  ) = cs (H,m h ) :=:=

Prof. Bayer, DWH, Ch.7, SS Example: cs (H, Bar) = = 538 l 1 =3 l 2 =3 l 3 =1 l 4 =3 number of bits needed at certain level

Prof. Bayer, DWH, Ch.7, SS Properties of MHC Encoding very compact coding of fixed length lexicographic order of composite keys remains, i.e. isomorphic to integer ordering point restrictions on arbitrary hierarchy levels lead to interval restrictions on the compound surrogates

Prof. Bayer, DWH, Ch.7, SS Example: path to USA is: North America --> USA 4 = = leads to range on cs: to and to the decimal range: 528 to543 or [528 : 543] ==> star join with restriction North America.USA leads to an interval restriction on the fact table ==> point restrictions on arbitrary hierarchy levels of several dimensions lead to Query Boxes on the fact table.

Prof. Bayer, DWH, Ch.7, SS Complex Hierarchies time with months and weeks, both restrictions lead to intervals on the level of days Example of Fig. 4-4 proposal for multiple hierarchies: choose the most useful (depending on the query profile) or consider multiple hierarchies as several independent hierarchies. Caution, this increases the number of dimensions !!! Time variant hierarchies: extend by time interval of validity, see Example Fig. 4-5,

Prof. Bayer, DWH, Ch.7, SS (a) (b) YEAR MONTHWEEK DAY REGION NATION TRADE TYPE CUSTOMER TYPE CUSTOMER SIZE CUSTOMER Fig. 4-4 Complex Hierarchy Graphs

Prof. Bayer, DWH, Ch.7, SS CUSTOMER South EuropeNorth America... USACanada RetailWholesale BarRestaurant Joe‘s Sports Bar Year<= 1997Year> 1997 Fig. 4-5 Change of a hierarchy over the time

Prof. Bayer, DWH, Ch.7, SS Orange Juice Asia

Prof. Bayer, DWH, Ch.7, SS Apple Juice Asia Processing a query box in sort order with the Tetris algorithm