MIS 451 Building Business Intelligence Systems Logical Design (1)
2 Project Planning Requirements Analysis Physical Design Logical Design Data Staging Data Analysis (OLAP)
3 Introduction to Dimensional Modeling Dimensional Modeling is a DW logical design technique that seeks to present data in a standard framework that is intuitive for data access and allows for high performance data access. Intuitive: easy to write SQL High performance: high performance SQL
4 ER Model Dimensional Model (Star Schema) For detailed information, please refer handout 1.
5 Introduction to Dimensional Modeling Analytical Report: 2-dimension January sales report by customer state and product category Query: list sales in Jan. by customer state and product category?
6 Introduction to Dimensional Modeling Query based on ER Model: Select State, PCName, SUM(Price*Quantity) From OrderLine OL, Customer C, Product_Category PC, Product P, Order O Where OL.OID = O.OID and OL.PID = P.PID and O.CID = C.CID and to_char(O.OrderDate,’MON’) = ’JAN’ and P.PCID = PC.PCID Group by State, PCName Join: 5 tables Query based on Dimensional Model: Select State, PCName, SUM(Sales) From Sales S, Customer C, Product P, Time T Where S.Time_ Key = T.Time_Key and S.Product_ Key = P.Product_Key and S.Customer_Key = C.Customer_Key and T.Month= ’JAN’ Group by State, PCName Join: 4 tables
7 Fact and Dimension Fact table Dimension table
8 Fact and Dimension There are two types of tables in dimensional modeling: Fact table: attributes in fact tables are measurements for analysis or contents in reports. Dimension table: attributes in dimension tables are constraints for the measurements or headers in reports. Dimensions Facts
9 Facts and Dimensions CriteriaFact AttributesDimension Attributes PurposeMeasurements for analysisConstraints for the measurements Reporting useReport contentRow or column report headers Data typeMost facts are numeric and additive. There are semi-additive or no-additive facts. Textual, descriptive SizeLarger number of recordsSmaller number of records
10 Facts and Dimensions How to identify facts and dimensions? Requirements Analysis: Analytical requirements: Marketing managers want to know sales performance for different product category in different states? Information requirements: quantity of product sold, sales amount, product category, and customer states ER Model
11
12 F1: Calculation F: refers to special considerations for fact table or special type of fact table
13 F1: Calculation Normalization in RDB 1NF 2NF 3NF Non-volatile property of data warehouse enables DW design to resist normalization and improve query performance.
14 D1: Slowly changing dimension D: refers to special considerations for dimension table or special type of dimension table
15 D1: Slowly changing dimension Values of attributes in dimension tables may evolve over time. For example, customers moved from one city to another city. CID CNameStateCity 101JonArizonaTucson 102TomArizonaTucson 103MarkArizonaPhoenix Tom moved from Tucson to Phoenix Phoenix
16 D1: Slowly changing dimension There are three ways to handle slowly changing dimension. Method 1: Overwrite old values with new values CID CNameStateCity 101JonArizonaTucson 102TomArizonaTucson 103MarkArizonaPhoenix CID CNameStateCity 101JonArizonaTucson 102TomArizonaPhoenix 103MarkArizonaPhoenix
17 D1: Slowly changing dimension Drawbacks of method 1: Historical information is totally lost. We will never know that customer 102 lived in Tucson before. Moreover, when listing sales by city, all the sales of customer 102 will be counted as part of Phoenix sales, although 102 was in Tucson before.
18 D1: Slowly changing dimension Method 2: Add a new attribute to record current value of the changing attribute. CID CNameStateCity 101JonArizonaTucson 102TomArizonaTucson 103MarkArizonaPhoenix CIDCNameStateOriginal CityCurrent City 101JonArizonaTucson 102TomArizonaTucsonPhoenix 103MarkArizonaPhoenix
19 D1: Slowly changing dimension Drawbacks of method 2: Only partial Historical information (original & current) is kept. Considering that customer 102 moved from Tucson to Flagstaff then to Phoenix, the customer information of customer 102 only includes Tucson and Phoenix.
20 D1: Slowly changing dimension Method 3: Add a record whenever a dimension attribute changes. CID CNameStateCity 101JonArizonaTucson 102TomArizonaTucson 103MarkArizonaPhoenix
21 D1: Slowly changing dimension Method 3 keep all the information. However, Is there any problem?
22 D1: Slowly changing dimension Method 4: warehouse key + method 3 Warehouse key is a sequence of non-negative integers served as primary keys of tables in data warehouse. CID CNameStateCity 101JonArizonaTucson 102TomArizonaTucson 103MarkArizonaPhoenix Warehouse key
23 D1: Slowly changing dimension Why warehouse key is needed in data warehouse? Solve slowly changing dimension problem Compared with natural keys (i.e., primary keys of tables in RDB, such as CID of customer table), warehouse keys have high join performance.
24 D1: Slowly changing dimension Warehouse key Primary keys in dimensional tables are warehouse keys. Primary key in fact table is a collection of warehouse keys of all/part of its associated dimensions.
25 D1: Slowly changing dimension Notation: Primary key
26 D2: Time Dimension D: refers to special considerations for dimension table or special type of dimension table
27 D2: Time Dimension Data warehouse needs an explicit time dimension table instead of just a time attribute (e.g, ORDERDATE). Besides the time attribute, time dimension table includes the following additional attributes: Day_of_week (1-7); Day_number_in_month (1-31); Day_number_in_year (1-365) Week_number (1-52); month (1-12), Quarter (1-4) Holiday_flag (y/n) Fiscal_quarter, Fiscal_year
28 D2: Time Dimension Time dimension can: Save computation effort and improve query performance Complex queries regarding calendar calculation are hidden from end users of data warehouse.
29 D3: Snowflake D: refers to special considerations for dimension table or special type of dimension table
30 D3: Snowflake Snowflake structure
31 D3: Snowflake Snowflake structure should be avoided in data warehouse design Tradeoff of avoiding snowflake Advantage: improve query performance Disadvantage: require more storage space