Presentation is loading. Please wait.

Presentation is loading. Please wait.

DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.

Similar presentations


Presentation on theme: "DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What."— Presentation transcript:

1 DIMENSIONAL MODELING MIS2502 Data Analytics

2 So we know… Relational databases are good for storing transactional data But bad for analytical data What we can do is design an analytical data store based on the operational data store That architecture gives us the advantages of both Relational database for operational use Analytical database for analysis (Online Analytical Processing)

3 Why have a separate ADS? Issue 1: Performance The structure is built to handle analysis You keep the load off the operational data store Issue 2: Usability We can structure the data in an intuitive way You keep the load off of your IT department

4 Some terminology Takes many forms Really is just a repository for data Data Warehouse More focused Specially designed for analysis Data Mart Organization of data as a “multidimensional matrix” Implementation of a Data Mart Data Cube

5 How they all relate The data in the operational database… …is put into a data warehouse… …which feeds the data mart… …and is analyzed as a cube. We’ll start here.

6 The Data Cube Core component of Online Analytical Processing and Multidimensional Data Analysis Made up of “facts” and “dimensions” quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price Product Store Time M&Ms Diet Coke Doritos Famous Amos Ardmore, PA Temple Main Cherry Hill, NJ King of Prussia, PA Jan. 2011 Feb. 2011 Mar. 2011 Quantity sold and total price are measured facts. Why isn’t product price a measured fact? Quantity sold and total price are measured facts. Why isn’t product price a measured fact?

7 The Data Cube quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price Product Store Time M&Ms Diet Coke Doritos Famous Amos Ardmore, PA Temple Main Cherry Hill, NJ King of Prussia, PA Jan. 2011 Feb. 2011 Mar. 2011 The highlighted element represents all the M&Ms sold in Ardmore, PA in January, 2011 A single summary record representing a business event (monthly sales).

8 The Data Cube quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price Product Store Time M&Ms Diet Coke Doritos Famous Amos Ardmore, PA Temple Main Cherry Hill, NJ King of Prussia, PA Jan. 2011 Feb. 2011 Mar. 2011 The highlighted elements represent Famous Amos cookies sold on Temple’s Main campus from January to March, 2011 This is called “slicing the data.”

9 The Data Cube quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price quantity & total price Product Store Time M&Ms Diet Coke Doritos Famous Amos Ardmore, PA Temple Main Cherry Hill, NJ King of Prussia, PA Jan. 2011 Feb. 2011 Mar. 2011 What do the orange highlighted elements represent? What do the blue highlighted elements represent?

10 The n-dimensional cube Could you have a data mart with five dimensions? If so, give an example Then why does our cube example (and most others you will see) only have three?

11 Designing the Cube: The Star Schema We can’t reasonably store the original data as a single table Summarization would be too slow A lot of redundancy So it is stored as a star schema Sales Sales_ID Product_ID Store_ID Time_ID Quantity Sold Total Price Product Product_ID Product_Name Product_Price Product_Weight Store Store_ID Store_Address Store_City Store_State Store_Type Time Time_ID Day Month Year Fact Dimension

12 A join to make the cube? Sales ID Qty. Sold Total Price Prod. ID Prod. Name Prod. Price Prod. Weight Store ID Store Address Store City Store State Store Type Time ID DayMonthYear 1000 1001 1002 Storing the entire join would generate many, many rows! Product Dimension Store Dimension Time Dimension Sales Fact

13 It adds up fast 1000 products300 stores365 days100 daily product purchases=10,950,000,000 records per year!

14 From Star Schema to Cubes Summarize the data and store it in the cubeRetrieve only the summary, not the raw data.Much more efficient, but we are “locked in”

15 Demo – Foodmart Dimensions Gender Houseowner Marital status Media type Product name Sales region Store name Total children Yearly income Measured Facts Cost Sales A pre-created data cube that can be read in Excel Summaries are already created

16 Updating the cube Data marts are non-volatile (i.e., they can’t be changed) Logically: It’s a record of what has happened Practically: Would require constant re-computation of the cube The cube is refreshed periodically from the transactional database Overnight Daily Weekly

17 Designing the Star Schema Kimball’s Four Step Process for Data Cube Design (Kimball et al., 2008) Choose the business process Decide on the level of granularity Identify the dimensions Identify the fact From where do you get the data?

18 Choose the business process What your data cube is “about” Determined by the questions you want to answer about your organization QuestionBusiness Process Who is my best customer?Sales What are my highest selling products?Sales Which teachers have the best student performance? Standardized testing Which supplier is offering us the best deals? Purchasing Note that a “business process” is not always about business.

19 Decide on the level of granularity Level of detail for each event (row in the table) Will determine the data in the dimensions Example: Who is my best customer? The “event” is a sales transaction Choices for time: yearly, quarterly, monthly, daily Choices for store: store, city, state How would you select the right granularity?

20 Identify the dimensions Determined by the business process Refined by the level of granularity The key elements of the process needed to answer to the question Example: Sales transaction Our example schema defines a “sale” as taking place for a particular product, in a particular store, at a particular time Could this data mart tell you The best selling product? The best customer? Try it for the “student performance” example.

21 Identify the fact The data associated with the business event Keys Unique identifier for each row Unique identifiers from dimensions Associates a combination of the dimensions to a unique business event Example: Sales has Product_ID, Store_ID, and Time_ID Measured, numeric data Quantifiable information for each business event Does not describe any particular dimension Describes a particular combination of dimensional data Example: Sales has quantity_sold and total_price. Try it for the “student performance” example.

22 Data cube caveats You have to choose your aggregations in advance So choose wisely! Consider a sales data cube with product, store, time, salesperson If quantity_sold and total_price are the facts, you can’t figure out the average number of people working in a store All people might not have sold all products and therefore wouldn’t be in the joined table Granularity is also an issue Can’t track daily sales if “date” is monthly (pre-aggregated) So why not include every single sale and do no aggregation beforehand?


Download ppt "DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What."

Similar presentations


Ads by Google