1 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B.

Slides:



Advertisements
Similar presentations
1 Copyright Jiawei Han; modified by Charles Ling for CS411a/538a Data Mining and Data Warehousing  Introduction  Data warehousing and OLAP for data mining.
Advertisements

Traveling Salesperson Problem
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Chapter 5 Fundamental Algorithm Design Techniques.
April 30, Data Warehousing and OLAP Technology: An Overview  What is a data warehouse?  Data warehouse architecture  From data warehousing to.
1 Copyright Jiawei Han. Modified by Charles Ling for CS411a/538a, UWO, Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Cube Explorer: Online Exploration of Data Cubes Jiawei Han, Jianyong Wang, Guozhu Dong, Jian Pei, Ke Wang.
The Role of Data Warehousing and OLAP Technologies CS 536 – Data Mining These slides are adapted from J. Han and M. Kamber’s book slides (
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Academic Advisor: Prof. Ronen Brafman Team Members: Ran Isenberg Mirit Markovich Noa Aharon Alon Furman.
Materialized View Selection in a Multidimensional Database Presenter: Dong Wang 3/14/2006.
Mining Association Rules
Lab3 CPIT 440 Data Mining and Warehouse.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
August 14, 2015Data Mining: Concepts and Techniques 1 Chapter 3: Data Warehousing and OLAP Technology: An Overview What is a data warehouse? Data warehouse.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
8/25/2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 3 — Jiawei Han Department of Computer Science University.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
 Data warehouses  Decision support  The multidimensional model  OLAP queries.
Data Warehouse Concepts Transparencies
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently.
1 Data Warehousing and Decision Support. 2 Data Warehousing and OLAP Technology What is a data warehouse? A multi-dimensional data model Data warehouse.
© 2003 Terry James. All rights reserved 1 The CRM Textbook: customer relationship management training Terry James © 2006 Chapter 12: Analytical.
Achieving Scalability in OLAP Materialized View Selection Thomas P. Nadeau Toby J. Teorey University of Michigan DOLAP 2002.
1 1 MSCIT 5210: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline.
Roadmap 1.What is the data warehouse, data mart 2.Multi-dimensional data modeling 3.Data warehouse design – schemas, indices 4.The Data Cube operator –
BUSINESS ANALYTICS AND DATA VISUALIZATION
COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Data Mining Lecture 2. Course Syllabus Course topics: Introduction (Week1-Week2) –What is Data Mining? –Data Collection and Data Management Fundamentals.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
2016年1月21日星期四 2016年1月21日星期四 2016年1月21日星期四 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 3 — Jiawei Han Department.
Spring 2008The Greedy Method1. Spring 2008The Greedy Method2 Outline and Reading The Greedy Method Technique (§5.1) Fractional Knapsack Problem (§5.1.1)
CS 361 – Chapter 10 “Greedy algorithms” It’s a strategy of solving some problems –Need to make a series of choices –Each choice is made to maximize current.
Data Warehouse A place the information system department puts the data that is turned into information. Data must be properly prepared,organized,and presented.
Polyhedral Optimization Lecture 5 – Part 3 M. Pawan Kumar Slides available online
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Datawarehousing and OLAP C.Eng 714 Spring
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
OLAP Theory-English version On-Line Analytical processing (Buisness Intelligence) Ing.Skorkovský,CSc Department of Corporate Economy Faculty of Economics.
Data Mining and Data Warehousing: Concepts and Techniques Conceptual Modeling of Data Warehouses Defining a Snowflake Schema in Data Mining Query Language.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Nan Zhang Texas A&M University
Week 11 – Data Warehouse INFOSYS 222.
Data warehouse.
Approximation Algorithms
Data Mining Data Warehousing
Analysis and design of algorithm
Data Warehouse and OLAP
Enhance BI Applications and Simplify Development
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
Introduction of Week 9 Return assignment 5-2
Data Mining: Concepts and Techniques
CUBE MATERIALIZATION E0 261 Jayant Haritsa
Data Warehousing Concepts
Dr. Arslan Ornek DETERMINISTIC OPTIMIZATION MODELS
The 2nd Generation Live Database: A “World Class Solution”
Data Warehouse and OLAP
Presentation transcript:

1 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B

2 Cuboids Materialization as an Optimization Problem Minimize: the average time taken to evaluate a view Constraint: materialize a fixed number k of views Greedy algorithm Best choice is given based on what has gone before It does not give the optimal solution

3 Example of lattice of views diagram psc pcpssc psc p: part s: supp c: cust

4 The lattice of views framework if view V2 can be answered using results of view V1 then V2 is descendent of V1 V1 is ancestor of V2 (denoted V2 ≼ V1) E.g. (part) ≼ (part, cust)

5 Some Definitions K is the number of views to be materialized C (v ) is the cost of view v Given v is a view S is a set of views which are already selected to be materialized The Benefit of selecting v for materialization is B(v, S) = C(S) – C(S U v)

6 Greedy Algorithm S  {top view}; For i = 1 to k do Select that view v not in S such that B(v, S) is maximized; S  S U {v} Return S

7 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = 0 Benefit from pc = Benefit 6M-6M = 0 k = 2

8 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit from ps = Benefit 6M-0.8M = 5.2M k = 2

9 1.1 Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit from sc = Benefit 6M-6M = 0 0 x 3 = 0 k = 2

Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit from p = Benefit 6M-0.2M = 5.8M 0 x 3 = x 1 = 5.8 k = 2

Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit from s = Benefit 6M-0.01M = 5.99M 0 x 3 = x 1 = x 1 = 5.99 k = 2

Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit from c = Benefit 6M-0.1M = 5.9M 0 x 3 = x 1 = x 1 = x 1 = 5.9 k = 2

Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit 0 x 3 = x 1 = x 1 = x 1 = 5.9 Benefit from pc = 6M-6M = 0 0 x 2 = 0 k = 2

Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit 0 x 3 = x 1 = x 1 = x 1 = 5.9 Benefit from sc = 6M-6M = 0 0 x 2 = 0 0 x 2 = 0 k = 2

15 psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit 0 x 3 = x 1 = x 1 = x 1 = 5.9 Benefit from p = 0.8M-0.2M = 0.6M 0 x 2 = 0 0 x 2 = x 1 = 0.6 k = 2

Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit 0 x 3 = x 1 = x 1 = x 1 = 5.9 Benefit from s = 0.8M-0.01M = 0.79M 0 x 2 = 0 0 x 2 = x 1 = x 1 = 0.79 k = 2

Data Cube psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit 0 x 3 = x 1 = x 1 = x 1 = 5.9 Benefit from c = 6M-0.1M = 5.9M 0 x 2 = 0 0 x 2 = x 1 = x 1 = x 1 = 5.9 k = 2

18 psc 6M pc 6Mps 0.8Msc 6M p 0.2Ms 0.01M c 0.1M 1st Choice (M)2nd Choice (M) pc ps sc p s c 0 x 3 = x 3 = 15.6 Benefit 0 x 3 = x 1 = x 1 = x 1 = x 2 = 0 0 x 2 = x 1 = x 1 = x 1 = 5.9 Two views to be materialized are 1.ps 2. c V = {ps, c} Gain(V U {top view}, {top view}) = = 21.5 k = 2

19 2 nd Example of greedy algorithm Initially, S = {a} k = 4 (select 3 more) a bc de g f h

20 2 nd Example of greedy algorithm First choice b: 50  5 = 250 c: 25  5 = 125 d: 80  2 = 160 e: 70  3 = 210 f: 60  2 = 120 g: 99  1 = 99 h: 90  1 = 90 a bc de g f h

21 2 nd Example of greedy algorithm Second choice c: 25  2 = 50 d: 30  2 = 60 e: 20  3 = 60 f : (100-40)  1 + (50-40)  1= = 70 g: 49  1 = 49 h: 40  1 = 40 a b c de g f h

22 2 nd Example of greedy algorithm Third choice c: 25  1 = 25 d: 30  2 = 60 e : (50-30)  2 + (40-30)  1=20   1 = 50 g: 49  1 = 49 h: 30  1 = 30 a b c de g f h

23 2 nd Example of greedy algorithm If we materialize only a then cost would be 8*100 =800 Now, cost is = 420 a b c e g f h d

24 Performance Study How bad does the Greedy Algorithm perform?

Data Cube a 200 b 100 c 99d 100 p st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit from b = Benefit = nodes … p q 1 97 … q r 1 97 … r s 1 97 … s k = 2

Data Cube a 200 b 100 c 99d 100 p st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit from c = Benefit = nodes … p q 1 97 … q r 1 97 … r s 1 97 … s x 101 = 4141 k = 2

Data Cube a 200 b 100 c 99d 100 p st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p q 1 97 … q r 1 97 … r s 1 97 … s x 101 = x 100 = 4100 k = 2

Data Cube a 200 b 100 c 99d 100 p st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p q 1 97 … q r 1 97 … r s 1 97 … s x 101 = x 100 = 4100 Benefit from b = = x 100 = 2100 k = 2

Data Cube a 200 b 100 c 99d 100 p st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p q 1 97 … q r 1 97 … r s 1 97 … s x 101 = x 100 = x 100 = x 100 = 2100 Greedy: V = {b, c} Gain(V U {top view}, {top view}) = = 6241 k = 2

Data Cube a 200 b 100 c 99d 100 p st Choice (M)2nd Choice (M) b c d ……… 41 x 100 = 4100 Benefit 20 nodes … p q 1 97 … q r 1 97 … r s 1 97 … s x 101 = x 100 = x 100 = 4100 Greedy: V = {b, c} Gain(V U {top view}, {top view}) = = x x 1 = 2141 Optimal: V = {b, d} Gain(V U {top view}, {top view}) = = 8200 k = 2

Data Cube a 200 b 100 c 99d 100 p nodes … p q 1 97 … q r 1 97 … r s 1 97 … s Greedy: V = {b, c} Gain(V U {top view}, {top view}) = = 6241 Optimal: V = {b, d} Gain(V U {top view}, {top view}) = = 8200 Greedy Optimal = = If this ratio = 1, Greedy can give an optimal solution. If this ratio  0, Greedy may give a “ bad ” solution. Does this ratio has a “ lower ” bound? It is proved that this ratio is at least k = 2

32 Indexing OLAP Data: Bitmap Index Relation table Index on RegionIndex on Type

33 Determining which materialized cuboid(s) should be selected for OLAP operations Query : Find the total sales group by {product-category, province} with the condition “year = 2004”. Which one of the 4 following materialized cuboids should be selected to process the query? 1) {year, product, city} 2) {year, product-category, country} 3) {year, product-category, province} 4) {product, province} where year = 2004

34 Solution: 1) {year, product, city} – it can be used. However, it costs most because product and city are of lower level 2) {year, product-category, country} – it cannot be used because country is a more general concept than province 3) {year, product_category, province} - it can be used. It could cost less than Solution 4, if there were no many year values and there are many products for each product-category. 4) {product, province} where year = it can be used. Let the query to be processed be on {product_category, province} with the condition “year = 2004”, and there are 4 materialized cuboids available:

35 Assume we want to find pairs of customers and items such that the customer has purchased the item at least 5 times select P.custid, P. item, sum(P.qty) from Purchases P group by P.custid, P.item having sum (P.qty) > 5 Execution plan for the query? The number of groups is very large but the answer to the query (the top of the iceberg) is usually very small Iceberg queries

36 select P.custid, P. item, sum(P.qty) from Purchases P group by P.custid, P.item having sum (P.qty) > 5 select P.custid from Purchases P group by P.custid having sum (P.qty) > 5 select P.item from Purchases P group by P.item having sum (P.qty) > 5 Generate (custid, item) pairs only for custid from Q1 and item from Q2 Q1 Q2 Iceberg queries

37 From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM) Why online analytical mining? High quality of data in data warehouses OLAP-based exploratory data analysis Easy selection of data mining functions

August 25, An OLAM System Architecture Data Warehouse Meta Data MDDB OLAM Engine OLAP Engine User GUI API Data Cube API Database API Data cleaning Data integration Layer3 OLAP/OLAM Layer2 MDDB Layer1 Data Repository Layer4 User Interface Filtering&IntegrationFiltering Databases Mining queryMining result

39 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B