CSD305 Data Warehouse Design

Slides:



Advertisements
Similar presentations
Data Warehousing Design Transparencies
Advertisements

Chapter 13 The Data Warehouse
Data Warehouse Architecture Sakthi Angappamudali Data Architect, The Oregon State University, Corvallis 16 th May, 2005.
Data Warehouse IMS5024 – presented by Eder Tsang.
Manajemen Basis Data Pertemuan 8 Matakuliah: M0264/Manajemen Basis Data Tahun: 2008.
Data Warehousing Design Transparencies
Chapter 13 The Data Warehouse
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
© 2003, Prentice-Hall Chapter Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Agenda Common terms used in the software of data warehousing and what they mean. Difference between a database and a data warehouse - the difference in.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
D ATABASE S YSTEMS D ATA W AREHOUSING I Asma Ahmad 29 th April, 2011.
Data Warehouse & Data Mining
Data Warehouse Concepts Transparencies
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Program Pelatihan Tenaga Infromasi dan Informatika Sistem Informasi Kesehatan Ari Cahyono.
Data Warehousing Concepts, by Dr. Khalil 1 Data Warehousing Design Dr. Awad Khalil Computer Science Department AUC.
BUS1MIS Management Information Systems Semester 1, 2012 Week 6 Lecture 1.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
1 Data Warehouses BUAD/American University Data Warehouses.
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
CISB594 – Business Intelligence
Sachin Goel (68) Manav Mudgal (69) Piyush Samsukha (76) Rachit Singhal (82) Richa Somvanshi (85) Sahar ( )
UNIT-II Principles of dimensional modeling
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Building the Corporate Data Warehouse Pindaro Demertzoglou Data Resource Management.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 9: DATA WAREHOUSING.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
TECHNOLOGY IN ACTION. Chapter 11 Behind the Scenes: Databases and Information Systems.
Data Warehousing Design DT211/4. Designing Data Warehouses To begin a data warehouse project, we need to find answers for questions such as: – Which user.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Business Intelligence Overview
Defining Data Warehouse Concepts and Terminology
Decision Support System by Simulation Model (Ajarn Chat Chuchuen)
Business Intelligence & Data Warehousing
Introduction to Data Warehouse
Chapter 13 Business Intelligence and Data Warehouses
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Warehouse—Subject‐Oriented
Data storage is growing Future Prediction through historical data
Summarized from various resources Modern Database Management
Data Warehouse.
Star Schema.
Overview and Fundamentals
Competing on Analytics II
Defining Data Warehouse Concepts and Terminology
Data Warehouse and OLAP
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
An Introduction to Data Warehousing
Introduction to Data Warehousing
C.U.SHAH COLLEGE OF ENG. & TECH.
MIS2502: Data Analytics Dimensional Data Modeling
Data Warehousing Data Model –Part 1
MIS2502: Data Analytics Dimensional Data Modeling
Introduction of Week 9 Return assignment 5-2
Data Warehouse.
Metadata The metadata contains
Chapter 17 Designing Databases
Data Warehousing Concepts
Data Warehouse and OLAP
Presentation transcript:

CSD305 Data Warehouse Design An introduction to data warehousing Connolly and Begg Database Systems 4th edition chapters 31 and 32 Introduction to Data Mining, Tan, Steinbach and Kumar, Pearson Educational

Agenda Business Intelligence The architecture of data warehouse and data mart Data integration and cleansing Data extraction and OLAP Data warehousing no longer seen as optional for many businesses. Database vendors now include data warehousing capabilities Not only for internal users but also external e.g. customers and suppliers. Driven by: government regulation that requires maintenance of transactional histories. Cheaper more reliable data storage Emergence of real-time data warehousing for time critical BI applications

Business Intelligence (BI) Data Mining (DM) and Knowledge Discovery in Databases (KDD) Recent advances in technologies have resulted in an explosion in the amount of data generated and collected by businesses and organizations Point of sale data in retail Smart card technology (e.g. Oyster cards) Satellite data Extracting useful information from these data sets can be challenging

Benefits Potential high returns on investment (ROI) Can cost tens thousands to millions to implement data warehouse Study by International Data Corporation (IDC) DW projects delivered average three-year ROI of 401% in 1996 later study analytical tools delivered average one-year ROI of 431% in 2002 Competitive advantage Huge ROI evidence of enormous competitive advantage. Taps into previous unknown untapped info on customers, trends and demands. Increased productivity of corporate decision makers Consistent, subject orientated historical data Integrates from multiple incompatible systems providing a consistent view Allows for more substantive, accurate and consistent analysis Benefits Potential high returns on investment (ROI) Can cost tens thousands to millions to implement data warehouse Study by International Data Corporation (IDC) DW projects delivered average three-year ROI of 401% in 1996 later study analytical tools delivered average one-year ROI of 431% in 2002 Competitive advantage Huge ROI evidence of enormous competitive advantage. Taps into previous unknown untapped info on customers, trends and demands. Increased productivity of corporate decision makers Consistent, subject orientated historical data Integrates from multiple incompatible systems providing a consistent view Allows for more substantive, accuratgte and consistent analysis

Knowledge Discovery in Databases (KDD) Input data Data pre processing Data Mining Post processing Knowledge Filtering Pattern analysis Visualization Cleansing Knowledge discovery in databases (KDD) is the process of discovering useful knowledge from a collection of data. A process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.

How to get to BI? Data mining techniques, technologies and algorithms are being developed to provide BI Before we get to data mining we must first pre process the data to get it into the right form This will often involve the design of a data warehouse: Design the data warehouse architecture Extract and cleanse the data Integrate the data into a common data mart schema

Data warehousing - General Architecture Operational Data Mainframe held in first gen hierarchical and network DB Departmental data in propriety file systems and RDBMS, VSAM (virtual Storage Access Method) IBM OS, and RMS file system for other vendors Private data on workstations and private servers External systems e.g. internet, commercial databases or DB’s associated with business customers or suppliers.

Data Warehousing concepts Subject-oriented Integrated Time-variant Non-volatile Subject orientated: Major subjects of business e.g. customers, products sales. Rather than application areas e.g. customer invoicing, stock control, product sales. Integrated: Source data is often inconsistent e.g using different formats. Integrated data source has to be made consistent Time-variant: Data in warehouse is accurate and valid only at some point in time or over some time interval. Data is held over an extended time An explicit and implicit association of time with data. Data represents a series of snapshots Non-volatile As data not updated in real time but refreshed on regular basis. New data is added as supplement rather than replacement Incrementally integrates with previous data Continuing to build historical data.

Data Integration Component Systems storing data may be different Data Type Differences Value differences (Colour:Black(0, or BL, or Black)) Semantic Differences ( Terms -> Different interpretations) E.g., Column ‘Title’ in one database means ‘Job Title’ while in another database it means ‘Person Title’ Missing Values (NULL VALUES) Different Schemata

Data Integration Conflicts Schema level conflicts are due to different perceptions different focus Value level conflicts are due to different representations, coding, etc different precision incorrect information data entry errors Data cleaning

The Warehousing Approach Information integrated in advance Stored in WH for direct querying and analysis ETL – Extract, Transform, Load Extract reading and converting data from various sources, usually Flat Files and RDBMS but could be any data storage. Transform using rules, filtering, sorting, aggregating, joining data, cleaning data, generating calculated data, validating data. Load transformed data to target database. Into an Operational Data Store (ODS) A repository of current and integrated data used for analysis Often structured in same way as the data warehouse but may simply be used as a staging area. Holds data already extracted from sources and cleaned.

Dimensional modelling Dimensional modelling - presenting the data in a standard form Each Dimensional Model One fact table Multiple Dimension tables Each dimensions gives a different focus to the data Star schema or Star join Logical design technique aims to present data in standard, intuitive form, allows for high-performance access. A fact table in the centre, surrounded by denormalized dimension tables

Star schema Fact table one table with a composite primary key. factual data generated by events occurred in the past. Unlikely to change, regardless of how they are analysed. Can be large relative to dimension tables Numerical measures or ‘facts’ that occur for each record Other examples: offer price, selling price, sale commission, sale revenue. Dimension Table Each has a simple PK Corresponds to exactly one component of composite key in fact table All natural keys are replaced by surrogate keys Natural keys has relationship with data, ISBN, vehicle registration number. Surrogate key no relationship with data, a generated value to make the data unique. General structure based on integers

Star schema essentials One fact table Multiple dimension tables Each dimensions gives a different focus to the data

Star join

Star join essentials Again - fact tables and dimension tables. Said to be star join when one large central table joined to two or more dimension tables The fact table represents the structure that holds the majority of the occurrences of the data. Fact tables typically combine data and cross reference keys from a variety of other tables. Dimension tables contain data which is not terribly voluminous. Dimension tables are related to fact tables by means of a foreign key relationship. Typical use would be for an end-user query on any fact table

Why star joins? By building star joins, the designer has created a structure for efficient access of large volumes of data and natural end-user viewing. Problem with star joins. In order to know how to create the star join, the designer must make assumptions about the usage of the data. One department will look at data very differently from another department. The star join for finance will be very different than the star join for production, for example.

Star Schema development steps Choose the process Choose the grain Identify the dimensions Choose the facts Choose business process for the data mart Data Mart contains subset of corporate data to support requirements of business unit e.g. sales department Data source likely accessible, high quality DreamHome processes include: Property sales Property rentals Property viewing Property advertising Property maintenance Best choice for a first data mart is sales and finance.

Choose Business Process Choose business process for the data mart Data Mart contains subset of corporate data to support requirements of business unit e.g. sales department DreamHome processes include: Property sales Property rentals Property viewing Property advertising Property maintenance Best choice for a first data mart is sales and finance. Data source likely accessible, high quality

Choose the Grain Balance between meeting business requirements and what is possible given the data source Grain determines what the fact table represents In this case sales Facts for each product Best to build model using lowest level of detail available. Only when the grain is chosen can we identify dimensions. Time is included as a core dimension, always present in dimensional models. Grain decision determines grain for dimension tables e.g. if grain is for saleFacts of each product, StoreDimension details of the store each sale took place in.

Choose Dimensions Dimension are context for asking questions about the facts. Identify dimensions in sufficient detail to describe things at the correct grain. Dimensions can be used in more than one dimensional model (Data Mart) Referred to as being conformed. Must be exactly the same or a subset. Allows for individual data marts to form the enterprise data warehouse support.

Choose Facts The grain of the fact table determines facts to be used. All facts must be expressed at level implied by grain E.g. if grain is for sale of individual products All numerical facts must refer to this particular sale Facts must also be numerical and addative (can be summed across any dimension) Additional facts can be added to the fact table at any time Provided they are consistent with the grain

Data extractions Data mart software will often include extraction and visualization tools These are given the umbrella term OLAP On-Line Analytical Process

Data extractions Image from http://projects.cs.dal.ca/panda/olap.html

OLAP and Data Mining We will see how a data warehouse could be examined in different dimensions In the following week we shall look at Data Mining and begin investigating various data mining models

DreamHome Star Schema from SQL Server

Dimensional Modelling Step 1: Select business process The process (function) refers to the subject matter of a particular data mart. First data mart built should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions.

ER model of extended version of DreamHome

ER model of property sales business process

Step 2: Declare grain Decide what a record of the fact table is to represent. Identify dimensions of the fact table. The grain decision for the fact table also determines the grain of each dimension table. Also include time as a core dimension, which is always present in star schemas.

Step 3: Choose dimensions Dimensions set the context for asking questions about the facts in the fact table. If any dimension occurs in two data marts, they must be exactly the same dimension, or one must be a mathematical subset of the other. A dimension used in more than one data mart is referred to as being conformed.

Star schemas for property sales and property advertising

Step 4: Identify facts The grain of the fact table determines which facts can be used in the data mart. Facts should be numeric and additive. Unusable facts include: non-numeric facts non-additive facts fact at different granularity from other facts in table Once the facts have been selected each should be re-examined to determine whether there are opportunities to use pre-calculations.