DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.

Slides:



Advertisements
Similar presentations
The Organisation As A System An information management framework The Performance Organiser Data Warehousing.
Advertisements

Dimensional Modeling.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
Alternative Database topology: The star schema
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Introduction to Data Warehouse and Data Mining MIS 2502 Data Analytics
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
Dimensional Modeling – Part 2
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
Data Warehousing ISYS 650. What is a data warehouse? A data warehouse is a subject-oriented, integrated, nonvolatile, time-variant collection of data.
MIS2502: Data Analytics Relational Data Modeling
Agenda Common terms used in the software of data warehousing and what they mean. Difference between a database and a data warehouse - the difference in.
ITEC 3220A Using and Designing Database Systems
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
MIS2502: Data Analytics Extract, Transform, Load
Bogdan Shishedjiev Data Analysis1 Data Analysis OLTP and OLAP Data Warehouse SQL for Data Analysis Data Mining.
Cube Intro. Decision Making Effective decision making Goal: Choice that moves an organization closer to an agreed-on set of goals in a timely manner Goal:
THE INFORMATION ARCHITECTURE OF THE ORGANIZATION MIS2502 Data Analytics.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
1 Data Warehouses BUAD/American University Data Warehouses.
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
BI Terminologies.
MIS2502: Data Analytics The Information Architecture of an Organization.
DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.
MIS2502: Data Analytics Dimensional Data Modeling
Basic Model: Retail Grocery Store
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Designing a Data Warehousing System. Overview Business Analysis Process Data Warehousing System Modeling a Data Warehouse Choosing the Grain Establishing.
Fox MIS Spring 2011 Data Warehouse Week 8 Introduction of Data Warehouse Multidimensional Analysis: OLAP.
UNIT-II Principles of dimensional modeling
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
1 On-Line Analytic Processing Warehousing Data Cubes.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
Data Warehousing Multidimensional Analysis
MIS2502: Data Analytics Relational Data Modeling
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
MIS2502: Data Analytics SQL – Getting Information Out of a Database.
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
What Do You Do With Data? Gather Store Retrieve Interpret.
ITEC 3220M Using and Designing Database Systems Instructor: Prof. Z.Yang Course Website: c3220m.htm Office: TEL.
INCREMENTAL AGGREGATION After you create a session that includes an Aggregator transformation, you can enable the session option, Incremental Aggregation.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
CMPE 226 Database Systems April 12 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
Jaclyn Hansberry MIS2502: Data Analytics The Things You Can Do With Data The Information Architecture of an Organization Jaclyn.
Operation Data Analysis Hints and Guidelines
MIS2502: Data Analytics Relational Data Modeling
On-Line Analytic Processing
MIS2502: Data Analytics Dimensional Data Modeling
MIS5101: Business Intelligence Relational Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
Overview and Fundamentals
Competing on Analytics II
Retail Sales is used to illustrate a first dimensional model
MIS2502: Data Analytics Dimensional Data Modeling
CMPE 226 Database Systems April 11 Class Meeting
MIS2502: Data Analytics Relational Data Modeling
Applying Data Warehouse Techniques
MIS2502: Data Analytics The Information Architecture of an Organization David Schuff
MIS2502: Data Analytics The Information Architecture of an Organization Acknowledgement: David Schuff.
MIS2502: Data Analytics The Information Architecture of an Organization Aaron Zhi Cheng Acknowledgement:
MIS2502: Data Analytics Dimensional Data Modeling
Retail Sales is used to illustrate a first dimensional model
Applying Data Warehouse Techniques
MIS2502: Data Analytics Dimensional Data Modeling
Retail Sales is used to illustrate a first dimensional model
Data Warehousing.
Presentation transcript:

DIMENSIONAL MODELING MIS2502 Data Analytics

So we know… Relational databases are good for storing transactional data But bad for analytical data What we can do is design an analytical data store based on the operational data store That architecture gives us the advantages of both Relational database for operational use Analytical database for analysis (Online Analytical Processing)

Why have a separate ADS? Issue 1: Performance The structure is built to handle analysis You keep the load off the operational data store Issue 2: Usability We can structure the data in an intuitive way You keep the load off of your IT department

Some terminology Takes many forms Really is just a repository for data Data Warehouse More focused Specially designed for analysis Data Mart Organization of data as a “multidimensional matrix” Implementation of a Data Mart Data Cube

How they all relate The data in the operational database… …is put into a data warehouse… …which feeds the data mart… …and is analyzed as a cube. We’ll start here.

The Data Cube Core component of Online Analytical Processing and Multidimensional Data Analysis Made up of “facts” and “dimensions” quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price Product Store Time M&Ms Diet Coke Doritos Famous Amos Ardmore, PA Temple Main Cherry Hill, NJ King of Prussia, PA Jan Feb Mar Quantity sold and total price are measured facts. Why isn’t product price a measured fact? Quantity sold and total price are measured facts. Why isn’t product price a measured fact?

The Data Cube quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price Product Store Time M&Ms Diet Coke Doritos Famous Amos Ardmore, PA Temple Main Cherry Hill, NJ King of Prussia, PA Jan Feb Mar The highlighted element represents all the M&Ms sold in Ardmore, PA in January, 2011 A single summary record representing a business event (monthly sales).

The Data Cube quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price Product Store Time M&Ms Diet Coke Doritos Famous Amos Ardmore, PA Temple Main Cherry Hill, NJ King of Prussia, PA Jan Feb Mar The highlighted elements represent Famous Amos cookies sold on Temple’s Main campus from January to March, 2011 This is called “slicing the data.”

The Data Cube quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price Product Store Time M&Ms Diet Coke Doritos Famous Amos Ardmore, PA Temple Main Cherry Hill, NJ King of Prussia, PA Jan Feb Mar What do the orange highlighted elements represent? What do the blue highlighted elements represent?

The n-dimensional cube Could you have a data mart with five dimensions? If so, give an example Then why does our cube example (and most others you will see) only have three?

Designing the Cube: The Star Schema We can’t reasonably store the original data as a single table Summarization would be too slow A lot of redundancy So it is stored as a star schema Sales Sales_ID Product_ID Store_ID Time_ID Quantity Sold Total Price Product Product_ID Product_Name Product_Price Product_Weight Store Store_ID Store_Address Store_City Store_State Store_Type Time Time_ID Day Month Year Fact Dimension

Revisiting Usability: Why a Cube? So you have your star schema, now what? Non-IT folks probably won’t understand data normalization and table relations in MySQL They won’t know how to do table JOINs They probably cannot work with raw table data Two options to produce usable data (i.e., cube) #1: Perform one big table JOIN from star schema #2: Calculate meaningful values, store in a data cube.

Option 1: One Big JOIN Sales ID Qty. Sold Total Price Prod. ID Prod. Name Prod. Price Prod. Weight Store ID Store Address Store City Store State Store Type Time ID DayMonthYear Storing the entire join would generate many, many rows! Product Dimension Store Dimension Time Dimension Sales Fact

It adds up fast… see Sales Person ex products300 stores365 days100 daily product purchases=10,950,000,000 records per year!

Option 2: Cube of Summary Stats Summarize the data and store it in the cubeRetrieve only the summary, not the raw data.Much more efficient, but we are “locked in”

Demo – Foodmart Dimensions Gender Houseowner Marital status Media type Product name Sales region Store name Total children Yearly income Measured Facts Cost Sales A pre-created data cube that can be read in Excel Summaries are already created

Updating the cube Data marts are non-volatile (i.e., they can’t be changed) Logically: It’s a record of what has happened Practically: Would require constant re-computation of the cube The cube is refreshed periodically from the transactional database Overnight Daily Weekly

Designing the Star Schema Ralph Kimball’s Four Step Process for Data Cube Design (Kimball et al., 2008) Choose the business process Decide on the level of granularity Identify the dimensions Identify the fact From where do you get the data? Can be in house, but might come from elsewhere too…

Choose the business process What your data cube is “about” Determined by the questions you want to answer about your organization QuestionBusiness Process Who is my best customer?Sales What are my highest selling products?Sales Which teachers have the best student performance? Standardized testing Which supplier is offering us the best deals? Purchasing Note that a “business process” is not always about business.

Decide on the level of granularity Level of detail for each event (row in the table) Will determine the data in the dimensions Example: Who is my best customer? The “event” is a sales transaction Choices for time: yearly, quarterly, monthly, daily Choices for store: store, city, state How would you select the right granularity?

Identify the dimensions Determined by the business process Refined by the level of granularity The key elements of the process needed to answer to the question Example: Sales transaction Our example schema defines a “sale” as taking place for a particular product, in a particular store, at a particular time Could this data mart tell you The best selling product? The best customer? Try it for the “student performance” example.

Identify the fact The data associated with the business event Keys Unique identifier for each row Unique identifiers from dimensions Associates a combination of the dimensions to a unique business event Example: Sales has Product_ID, Store_ID, and Time_ID Measured, numeric data Quantifiable information for each business event Does not describe any particular dimension Describes a particular combination of dimensional data Example: Sales has quantity_sold and total_price. Try it for the “student performance” example.

Data cube caveats You have to choose your aggregations in advance So choose wisely! Consider a sales data cube with product, store, time, salesperson If quantity_sold and total_price are the facts, you can’t figure out the average number of people working in a store All people might not have sold all products and therefore wouldn’t be in the joined table Granularity is also an issue Can’t track daily sales if “date” is monthly (pre-aggregated) So why not include every single sale and do no aggregation beforehand?

OGE Energy (Oklahoma) Example Trying to reduce peak power demand A few strategies… Variable Pricing; Smart Meters; Customer Notifications (Text, , Twitter); Customer Rewards sources: OGE overview, OGE overviewhttp://energy.gov/

OGE Energy (Oklahoma) Example Business question “How can we reduce peak power demand?” What are the relevant facts (perform. measures)? Power consumption (kw/h) Power outages What are relevant dimensions? Time (hours or minutes), location, weather_emergency, price, smart_meter, communications, rebates

OGE Energy Star Schema PowerDraw PowerDrawID Customer_id Comms_id Time_id Location_id Consumption Outages Customer customer_ID Attribute 1 Attribute 2 Attribute 3 Location _ID Attribute 1 Attribute 2 Attribute 3 Time Time_ID Day Month Year Hour Fact Dimension Communicat. comms_ID Num_texts Num_tweets Num_ s Dimension