ICS 421 Spring 2010 Data Warehousing 3 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/1/20101Lipyeow.

Slides:



Advertisements
Similar presentations
Outline  Introduction  Background  Distributed DBMS Architecture  Distributed Database Design  Semantic Data Control ➠ View Management ➠ Data Security.
Advertisements

Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
ICS 421 Spring 2010 Indexing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 02/18/20101Lipyeow Lim.
ICS 421 Spring 2010 Data Warehousing 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/30/20101Lipyeow.
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Data Warehouse IMS5024 – presented by Eder Tsang.
CS 286, UC Berkeley, Spring 2007, R. Ramakrishnan 1 Decision Support Chapter 25.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) The Data Warehouse Lifecycle Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Database Systems More SQL Database Design -- More SQL1.
Data Warehouse View Maintenance Presented By: Katrina Salamon For CS561.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 157 Database Systems I SQL Constraints and Triggers.
Lecture-8/ T. Nouf Almujally
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
Data Cube Computation Model dependencies among the aggregates: most detailed “view” can be computed from view (product,store,quarter) by summing-up all.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
CPSC 404, Laks V.S. Lakshmanan 1 Data Warehousing & OLAP Chapter 25, Ramakrishnan & Gehrke (Sections )
Database Design - Lecture 1
Data Administration & Database Administration
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25, Part B.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
CODD’s 12 RULES OF RELATIONAL DATABASE
311: Management Information Systems Database Systems Chapter 3.
© 2007 by Prentice Hall 1 Introduction to databases.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Part Two: - The use of views. 1. Topics What is a View? Why Views are useful in Data Warehousing? Understand Materialised Views Understand View Maintenance.
1 Data Warehouses BUAD/American University Data Warehouses.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Data Warehouse & OLAP Kuliah 1 Introduction Slide banyak mengambil dari acuan- acuan yang dipakai.
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
Data Warehousing.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
ICS 321 Fall 2011 Constraints, Triggers, Views & Indexes Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa.
The Relational Model1 ER-to-Relational Mapping and Views.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
Sachin Goel (68) Manav Mudgal (69) Piyush Samsukha (76) Rachit Singhal (82) Richa Somvanshi (85) Sahar ( )
Methodology – Physical Database Design for Relational Databases.
Module Coordinator Tan Szu Tak School of Information and Communication Technology, Politeknik Brunei Semester
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.
Foundations of Business Intelligence: Databases and Information Management.
7 Strategies for Extracting, Transforming, and Loading.
OLAP & Data Warehousing. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
By N.Gopinath AP/CSE.  The data warehouse architecture is based on a relational database management system server that functions as the central repository.
SCALING AND PERFORMANCE CS 260 Database Systems. Overview  Increasing capacity  Database performance  Database indexes B+ Tree Index Bitmap Index 
MIS 451 Building Business Intelligence Systems Data Staging.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
1 Data Warehousing Data Warehousing. 2 Objectives Definition of terms Definition of terms Reasons for information gap between information needs and availability.
Copyright © 2004 Pearson Education, Inc.. Chapter 24 Enhanced Data Models for Advanced Applications.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Plan for Populating a DW
CS122A: Introduction to Data Management Lecture #11: SQL (4) -- Triggers, Views, Access Control Instructor: Chen Li.
Pertemuan <<13>> Data Warehousing dan Decision Support
Views, Stored Procedures, Functions, and Triggers
Data Warehouse.
Views and Security views and security.
CS 440 Database Management Systems
View and Index Selection Problem in Data Warehousing Environments
Data Warehousing and Decision Support
Views 1.
Data Warehousing Concepts
Best Practices in Higher Education Student Data Warehousing Forum
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

ICS 421 Spring 2010 Data Warehousing 3 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/1/20101Lipyeow Lim -- University of Hawaii at Manoa

Implementation Issues Recall requirements of a data warehouse: – Read only (updates via ETL) – Ad hoc queries – Interactive response times How do we support fast response times ? – Indexing, new indexes – Pre-compute results aka materialization – Views 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa2

Bitmap Indexes What are the possible QEPs for this query ? What if there are indexes on gender and rating ? How does bitmap indexes help ? Why bitmap indexes and not B+ trees ? 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa3 CustIDNameGenderRating 112JoeM3 115RamM5 119SueF5 117WooM4 How many male customers have a rating of 5? SELECT COUNT(*) FROM Customer WHERE Gender=‘M’ AND Rating=5 MF Bitmap Index for Gender Bitmap Index for Rating

Join Indexes How do we speed up joins with dimension tables ? A join index stores the RIDs of all the join tuples: – [RID(Sales), RID(Products), RID(Times), RID(Locations)] Variant: if many join queries have predicates on state – [Value(Location.state). RID(Sales), RID(Products), RID(Times), RID(Locations)] – B+ tree with Location.state as key and the tuple of RIDs as the data entry. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa4 timeIDDateWeekMonthQuaterYear PidTimeidLocidSales PidPnameCategoryPriceLocidCityStateCountry Times Sales (Fact Table) Products Locations SELECT S.Sales, T.*, P.*, L.* FROM Sales S, Times T, Products P, Locations L WHERE S.pid=P.pid AND S.Timeid=T.timeID AND S.locid=L.locid AND L.State=‘HI’

Bitmap Join Index (Oracle) Create a bitmap index where – One bitvector per L.state – Each bitvector encodes RIDs of Sales Index ANDing of multiple of these bitmap join indexes are efficient! What is the difference with a regular bitmap index ? 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa5 SELECT S.Sales FROM Sales S, Locations L WHERE S.locid=L.locid AND L.State=‘HI’ CREATE BITMAP INDEX myidx ON Sales(st.state) FROM Sales S, Locations L WHERE S.locid=L.locid

Views (Evaluate on Demand) A view is conceptually the same as a relation, but we store a definition, rather than a set of tuples. Views can be dropped using the DROP VIEW command.  How to handle DROP TABLE if there’s a view on the table? DROP TABLE command has options to let the user specify this. 4/1/20106Lipyeow Lim -- University of Hawaii at Manoa CREATE VIEW RegionalSales(category,sales,state) AS SELECT P.category, S.sales, L.state FROM Products P, Sales S, Locations L WHERE P.pid=S.pid AND S.locid=L.locid

Query Rewriting using Views 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa7 CREATE VIEW RegionalSales(category,sales,state) AS SELECT P.category, S.sales, L.state FROM Products P, Sales S, Locations L WHERE P.pid=S.pid AND S.locid=L.locid View Definition SELECT R.category, R.state, SUM(R.sales) FROM RegionalSales as R GROUP BY R.category, R.state Query SELECT R.category, R.state, SUM(R.sales) FROM (SELECT P.category, S.sales, L.state FROM Products P, Sales S, Locations L WHERE P.pid=S.pid AND S.locid=L.locid ) as R GROUP BY R.category, R.state Re-written Query

View Materialization (Precomputation) Suppose we precompute RegionalSales and store it with a clustered B+ tree index on [category,state,sales]. – Then, previous query can be answered by an index-only scan. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa8 SELECT R.state, SUM (R.sales) FROM RegionalSales R WHERE R.category=“Laptop” GROUP BY R.state SELECT R.state, SUM (R.sales) FROM RegionalSales R WHERE R. state=“Wisconsin” GROUP BY R.category

Materialized Views A view whose tuples are stored in the database is said to be materialized. – Provides fast access, like a (very high-level) cache. – Need to maintain the view as the underlying tables change. – Ideally, we want incremental view maintenance algorithms. Close relationship to data warehousing, OLAP, (asynchronously) maintaining distributed databases, checking integrity constraints, and evaluating rules and triggers. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa9

Issues in View Materialization What views should we materialize, and what indexes should we build on the precomputed results? Given a query and a set of materialized views, can we use the materialized views to answer the query? How frequently should we refresh materialized views to make them consistent with the underlying tables? (And how can we do this incrementally?) 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa10

View Maintenance Two steps: – Propagate: Compute changes to view when data changes. – Refresh: Apply changes to the materialized view table. Maintenance policy: Controls when we do refresh. – Immediate: As part of the transaction that modifies the underlying data tables. (+ Materialized view is always consistent; - updates are slowed) – Deferred: Some time later, in a separate transaction. (- View becomes inconsistent; + can scale to maintain many views without slowing updates) 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa11

Deferred Maintenance Three flavors: – Lazy: Delay refresh until next query on view; then refresh before answering the query. – Periodic (Snapshot): Refresh periodically. Queries possibly answered using outdated version of view tuples. Widely used, especially for asynchronous replication in distributed databases, and for warehouse applications. – Event-based: E.g., Refresh after a fixed number of updates to underlying data tables. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa12

Inc. View Maintenance: Inserts What information is available? – Only materialized view available: Add p5 if it isn’t there. – Parts table is available: If there isn’t already a parts tuple p5 with cost >1000, add p5 to view. May not be available if the view is in a data warehouse! – If we know pno is key for parts: Can infer that p5 is not already in view, must insert it. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa13 expensive_parts(pno) :- parts(pno, cost), cost > 1000 CREATE VIEW expensive_parts(pno) AS SELECT pno FROM parts WHERE cost > 1000 Suppose parts(p5,5000) is inserted

Inc. View Maintenance: Deletes What information is available? – Only materialized view available: If p1 is in view, no way to tell whether to delete it. (Why?) If count(#derivations) is maintained for each view tuple, can tell whether to delete p1 (decrement count and delete if = 0). – Parts table is available: If there is no other tuple p1 with cost >1000 in parts, delete p1 from view. – If we know pno is key for parts: Can infer that p1 is currently in view, and must be deleted. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa14 expensive_parts(pno) :- parts(pno, cost), cost > 1000 Suppose parts(p1,3000) is delerted

Inc. Maintenance Algorithm: Inserts Step 0: For each tuple in the materialized view, store a “derivation count”. Step 1: Rewrite this rule using Seminaive rewriting, set “delta_old” relations for Rel1 and Rel2 to be the inserted tuples. Step 2: Compute the “delta_new” relations for the view relation. – Important: Don’t remove duplicates! For each new tuple, maintain a “derivation count”. Step 3: Refresh the stored view by doing “multiset union” of the new and old view tuples. (I.e., update the derivation counts of existing tuples, and add the new tuples that weren’t in the view earlier.) 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa15 View(X,Y) :- Rel1(X,Z), Rel2(Z,Y)

Inc. Maintenance Algorithm: Deletes Steps 0 - 2: As for inserts. Step 3: Refresh the stored view by doing “multiset difference ” of the new and old view tuples. – To update the derivation counts of existing tuples, we must now subtract the derivation counts of the new tuples from the counts of existing tuples. The “counting” algorithm can be generalized to views defined by multiple rules. In fact, it can be generalized to SQL queries with duplicate semantics, negation, and aggregation. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa16 View(X,Y) :- Rel1(X,Z), Rel2(Z,Y)

Maintaining Warehouse Views Main twist: The views are in the data warehouse, and the source tables are somewhere else (operational DBMS, legacy sources, …). 1)Warehouse is notified whenever source tables are updated. (e.g., when a tuple is added to r2) 2)Warehouse may need additional information about source tables to process the update (e.g., what is in r1 currently?) 3)The source responds with the additional info, and the warehouse incrementally refreshes the view. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa17 Data Warehouse DB Extract, Transform, Load, Refresh Metadata view(sno) :- r1(sno, pno), r2(pno, cost) What happens if source is updated between Steps 1 and 3?

Example: Warehouse View Maintenance Initially, we have r1(1,2), r2 empty insert r2(2,3) at source; notify warehouse Warehouse asks ?r1(sno,2) – Checking to find sno’s to insert into view insert r1(4,2) at source; notify warehouse Warehouse asks ?r2(2,cost) – Checking to see if we need to increment count for view(4) Source gets first warehouse query, and returns sno=1, sno=4; these values go into view (with derivation counts of 1 each) Source gets second query, and says Yes, so count for 4 is incremented in the view – But this is wrong! Correct count for view(4) is 1. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa18 view(sno) :- r1(sno, pno), r2(pno, cost)

Warehouse View Maintenance Approaches Alternative 1: Evaluate view from scratch – On every source update, or periodically Alternative 2: Maintain a copy of each source table at warehouse Alternative 3: More fancy algorithms – Generate queries to the source that take into account the anomalies due to earlier conflicting updates. 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa19

Data Warehousing Architecture 4/1/2010Lipyeow Lim -- University of Hawaii at Manoa20 Data Warehouse Extract Transform Load Refresh OLAP Engine Analysis Query Reports Data mining Monitor & Integrator Metadata Data Sources Front-End Tools Serve Data Marts Operational DBs other sources Data Storage OLAP Server