ISQS 3358, Business Intelligence Dimensional Modeling Zhangxi Lin Texas Tech University 1 1
Outline Illustrative Example : Adventure Works Cycles (AWC) Principles of Dimensional Modeling The WPC example Illustrative Example : Adventure Works Cycles (AWC) Dimensional Modeling Process
PRINCIPLES OF DIMENSIONAL MODELING
E-R model
Dimensional Model Also called star schema (but snowflake schema is also fine) Fact table is in the middle and dimensions serving as the points on the star. A normalized fact table plus denormalized dimension tables Reference: database normalization Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization and what we now know as the First Normal Form (1NF) in 1970. Codd went on to define the Second Normal Form (2NF) and Third Normal Form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974. Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies. 5
Star Schema Model Product_id Store_id Product_disc,... District_id,... Product Table Product_id Product_disc,... Store Table Store_id District_id,... Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Central fact table Denormalized dimensions Time Table Day_id Month_id Year_id,... Star Schema Model A star schema model can be depicted as a simple star; a central table contains fact data, and multiple tables radiate out from it, connected by database primary and foreign keys. Unlike other database structures, a star schema has denormalized dimensions. A star model: Is easy to understand by the users because the structure is so simple and straightforward Provides fast response to queries with optimization and reductions in the physical number of joins required between fact and dimension tables Contains simple metadata Is supported by many front end tools Is slow to build because of the level of denormalization The star schema is emerging as the predominant model for data warehouses or data marts. Item Table Item_id Item_desc,... 6 6
Snowflake Schema Model Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Time Table Week_id Period_id Year_id Item Table Item_id Item_desc Dept_id Dept Table Dept_id Dept_desc Mgr_id Mgr Table Dept_id Mgr_id Mgr_name Snowflake Schema Model According to Ralph Kimball “a dimension is said to be snowflaked when the low cardinality fields in the dimension have been removed to separate tables and linked back into the original table with artificial keys.” A snowflake model is closer to an entity relationship diagram than the classic star model because the dimension data is more normalized. Developing a snowflake model means building class hierarchies out of each dimension (normalizing the data). A snowflake model: Results in severe performance degradation because of its greater number of table joins Provides a structure that is easier to change as requirements change Is quicker at loading data into its smaller normalized tables, compared to loading into a star schema’s larger denormalized tables Allows using history tables for changing data, rather than level fields (indicators) Has a complex metadata structure that is harder for end user tools to support One of the major reasons why the star schema model has become more predominant than the snowflake model is its query performance advantage. In a warehouse environment, the snowflake’s quicker load performance is much less important than its slower query performance. 7 7
Snowflake Schema Model Direct use by some tools More flexible to change Provides for speedier data loading Can become large and unmanageable Degrades query performance More complex metadata Country State County City Snowflake Schema Model (continued) Besides the star and snowflake schemas, there are other models that can be considered. Constellation A constellation model (also called galaxy model) simply comprises a series of star models. Constellations are a useful design feature if you have a primary fact table, and summary tables of a different dimensionality. It can simplify design by allowing you to share dimensions among many fact tables. Third Normal Form Warehouse Some data warehouses consist of a set of relational tables that have been normalized to third normal form (3NF). Their data can be directly accessed by using SQL code. They may have more efficient data storage, at the price of slower query performance due to extensive table joins. Some large companies build a 3NF central data warehouse feeding dependent star data marts for specific lines of business. 8 8
Facts Definition Measure – a numeric quantity expressing some aspect of the organization’s performance Aggregate – formed by combining values from a given dimension or set of dimensions to create a single value. Measurements associated with a specific business process. Most facts are additive (calculative); others are semi- additive, non-additive, or descriptive (e.g. factless fact table). Many facts can be derived from other facts. So, non- additive facts can be avoided by calculating it from additive facts.
Fact Table Characteristics Contain numerical metrics of the business Can hold large volumes of data Can grow quickly Can contain base, derived, and summarized data Are typically additive Are joined to dimension tables through foreign keys that reference primary keys in the dimension tables Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units ... Fact Table Characteristics Facts are the numerical measures of the business. The fact table is the largest table in the star schema and is composed of large volumes of data, usually making up 90% or more of the total database size. It can be viewed in two parts: Multipart primary key Business metrics Numeric Additive (usually) Often a measure may be required in the warehouse, but it may not appear to be additive. These are known as semiadditive facts. Inventory and room temperature are two such numerical measurements. It does not make sense to add these numerical measurements over time, but they can be aggregated by using an SQL function other than sum, for example average. Although a star schema typically contains one fact table, other DSS schemas can contain multiple fact tables. 10 10
The Three Fact Table Types Transaction fact table The most basic and fundamental “One row per line in a transaction", e.g., every line on a receipt A transactional fact table holds data of the most detailed level have a great number of dimensions associated with it Periodic snapshot fact table Takes a "picture of the moment“ Cumulative performance over specific time intervals Dependent on the transactional table Valuable to combine data across several business processes in the value chain Accumulating snapshot fact table Used to show the activity of a process that has a well-defined beginning and end Constantly updated over time 11
Types of facts Transaction fact: each row Week Date Trans# Change OldBal NewBal 1 A1-1 100 1000 1100 2 A1-2 -50 1050 4 A1-3 200 1250 A2-1 -120 1130 A2-2 1330 3 A3-1 -300 1030 A4-1 -20 1010 A4-2 1110 A4-3 250 1360 5 A4-4 -220 1140 Transaction fact: each row Periodic snapshot fact: (OldBal, NewBal) on each transaction Accumulating snapshot fact: The average numbers in a week, such as average balance, number of transactions, average amount of transactions, the total amount of trading in a given period.
Dimensions Definition: a categorization used to spread out an aggregate measure to reveal its constituent part The foundation of the dimensional model to describe the objects of the business The nouns of the DW/BI system Business processes (facts) are the verbs of the business Dimension tables link to all the business processes. A dimension shared across all processes is called conformed dimension The analysis involving data from more than one business process is called drill-across. 13
Attributes An additional piece of information pertaining to a dimension member that is not the unique identifier or the description of the member. Attributes can be used to more fully describe dimension members
Dimension Table Characteristics Dimension tables have the following characteristics: Contain textual information that represents the attributes of the business Contain relatively static data Are joined to a fact table through a foreign key reference Dimension Table Characteristics Dimensions are the textual descriptions of the business. Dimension tables are typically smaller than fact tables and the data changes much less frequently. Dimension tables give perspective regarding the whys and hows of the business and element transactions. Although dimensions generally contain relatively static data, customer dimensions are updated more frequently. Dimensions Are Essential for Analysis The key to a powerful dimensional model lies in the richness of the dimension attributes because they determine how facts can be analyzed. Dimensions can be considered as the entry point into “fact space.” Always name attributes in the users’ vocabulary. That way, the dimension will document itself and its expressive power will be apparent. 15 15
Star Dimensional Model Characteristics The model is easy for users to understand. Primary keys represent a dimension. Nonforeign key columns are values. Facts are usually highly normalized. Dimensions are completely denormalized. Fast response to queries is provided. Performance is improved by reducing table joins. End users can express complex queries. Support is provided by many front-end tools. Star Model Characteristics Each foreign key column on the fact table represents a dimension. The nonprimary key columns in the fact table are values that can be aggregated. Fact tables do not contain character values; these belong with the dimensions. The star model structure is similar to how the users understand the information. The model provides better performance for analytical queries by reducing the number of joins. It allows complex queries to be expressed by end users, because the data is arranged in a way that is easy to understand and the relationships between entities are very clear. It restricts the numerical measurements of the business to the fact table. Note: The definitions of star and snowflake models vary among practitioners. Here the assumption is that the star model contains a fact table with one level of related dimensions. An example is sales fact and product dimension. The snowflake, on the other hand, has more than one level of dimension; that is, a hierarchy, for example, Sales Fact, Product Dimension, and Product Group. 16 16
Where should the element of time be stored? The Time Dimension Time is critical to the data warehouse. A consistent representation of time is required for extensibility. Where should the element of time be stored? Time dimension Sales fact The Time Dimension Because online transaction data, typically the source data for the warehouse, does not have a time element, you apply an element of time in the extraction, transformation, and transportation process. For example, you might assign a week identifier to all the airline tickets that sold within that week. The transaction may not have a time or date stamp on it, but you know what date the sale has occurred by the generation of the transaction file. The dimension of time is most critical to the data warehouse. A consistent representation of time is required for extensibility. Storing the Time Dimension Typically there is a time dimension table in the data warehouse although time elements may be stored on the fact table. Before deciding where to store time, you must consider the following: Almost every data warehouse has a time dimension. Organizations use a variety of time periods for data analysis. A row whose key is an SQL date may be populated with additional time qualifiers needed to perform business analysis, such as workday, fiscal period, and special events. 17 17
Hierarchies Variable-depth hierarchies Frequently changing hierarchies Meaningful, standard ways to group the data within a dimension Variable-depth hierarchies Frequently changing hierarchies Examples of hierarchy in a dimension Address: street, city, state, country Organization: section, division, branch, region Time: year, quarter, month, date 18
Data Cube Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another. Data cubes aren't restricted to just three dimensions. Most OLAP systems can build data cubes with many more dimensions allows up to 64 dimensions. In practice, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions. 19
Data Cube Region Product Time
OLAP system OLAP – allows users to retrieve information from data quickly for analysis purposes Features Multidimensional database Easily understood What is OLAP? 5’04” SQL OLAP Tutorial - Data Warehouse Schema Design 9’45”
Wedgewood Pacific Corporation
Wedgewood Pacific Corporation
Data Mart Planning Fact table – collections of measures Measures: MaxHours, HoursWorked Dimension tables Time: StartDate, EndDate Project Employee Department
Data Mart Design Fact Table Dimension Tables PROJ_ASSIGN (ProjectID, EmployeeNumber, HoursWorked, StartDate, EndDate) Dimension Tables TIME(Year-Quarter-Month-Date) PROJECT(ProjectID, ProjectName, Department, MaxHours) DEPARTMENT(DepartmentName, BudgetCode, OfficeNumber, Phone) EMPLOYEE(EmployeeNumber, FirstName, LastName, Department, Phone, Email)
Dimensional Model DIM_PROJECT ---------------------- ProjectID ProjectName Department MaxHours StartDate EndDate …… DIM_DEPSRTMENT --------------------------- DepartmentName BudgetCode OfficeNumber Phone PROJ_ASSIGN ---------------------- EmployeeNumber ProjectID HoursWorked RequiredByEmp StartDate EndDate DIM_EMPLOYEE ------------------------ EmployeeNumber FirstName LastName Department Phone Email DIM_TIME --------------- Year Quarter Month Week Date Question: How could I crease the data mart tables conveniently? Answer: using SQL Server SSIS functions for ETL.
Data Mart Design – Scheme 2 Fact Table PROJect (ProjectID, ProjectName, Department, MaxHours, StartDate, EndDate) Dimension Tables TIME(Year-Quarter-Month-Date) DEPARTMENT(DepartmentName, BudgetCode, OfficeNumber, Phone) This is a simplified model from the previous
Data Mart Design Fact Table Dimension Tables PROJ_ASSIGN (ProjectID, EmployeeNumber, HoursWorked, StartDate, EndDate) Dimension Tables TIME(Year-Quarter-Month-Date) PROJECT(ProjectID, ProjectName, Department, MaxHours) DEPARTMENT(DepartmentName, BudgetCode, OfficeNumber, Phone) EMPLOYEE(EmployeeNumber, FirstName, LastName, Department, Phone, Email)
ILLUSTRATIVE EXAMPLE : ADVENTURE WORKS CYCLES (AWC)
Adventure Works Cycles (AWC) A fictitious multinational manufacturer and seller of bicycles and accessories Based on Bothell, Washington, USA and has regional sales offices in several countries http://www.msftdwtoolkit.com/
Basic Business Information Product orders by category Product Orders by Country/Region Product Orders by Sales Channel Customers by Sales Channel Snapshot
Business Processes Purchase Orders Distribution Center Deliveries Distribution Center Inventory Store Deliveries Store Inventory Store Sales
Analytic Themes See the Excel file \\TechShare\coba\d\isqs3358\Repository\AWC\ AW_Analytic_Themes_List.xls SQL Server 2008 R2 – Data Warehousing Scaling and Performance 41’28”
AWC’s Bus Matrix Dimensions Business Process Date Product Employee Dimensions Business Process Date Product Employee Customer (Reseller) Customer (Internet) Sales Territory Currency Channel Promotion Call Reason Facility Sales Forecasting X Orders Call Tracking Returns X X
Prioritization Grid High Customer Profitability Orders Product Forecast Business Value / Impact Call Tracking Exchange Rates Returns Manufacturing Costs Feasibility Low Low High
Unified Dimensional Model (UDM) A SQL Server 2008 technology A UDM is a structure that sits over the top of a data mart and looks exactly like an OLAP system to the end user. Advantages No need for a data mart. Can be built over one or more OLTP systems. Mixed data mart and OLTP system data Can include data from database from other vendors and XML- formatted data Allows OLAP cubes to be built directly on top of transactional data Low latency Ease of creation and maintenance Features Data sources Data views Proactive caching for preprocessed aggregates
Dimensional Modeling Process
Dimensional Modeling Process High level dimensional model design Choosing business model in accordance with the analytic theme Declaring the grain Choosing dimensions Identifying the facts Detailed dimensional model development Dimensional model review and validation IS Core users Business community Final design iteration
Identifying Measures and Dimensions The attribute varies continuously: Balance Units Sold Cost Sales Measures The attribute is perceived as constant or discrete: Product Location Time Size Dimensions Identifying Measures and Dimensions Measures A measure (or fact) contains a numeric value that measures an aspect of the business. Typical examples are gross sales dollars, total cost, profit, margin dollars, or quantity sold. A measure can be additive or partially additive across dimensions. Dimensions A dimension is an attribute by which measures can be characterized or analyzed. Dimensions bring meaning to raw data. Typical examples are customer name, date of order, or product brand. Ultimately, the business requirements document should contain a list of the business measures and a detailed list of all dimensions, down to the lowest level of detail for each dimension. An example is shown in the slide for a retail customer sales process. 39 39
Using a Business Process Matrix Business Dimensions Business Processes Sales Returns Inventory Customer Date Product Channel Promotion Using a Business Process Matrix A useful tool to understand and quantify business processes is the business process matrix (also called the process/dimension matrix). This matrix establishes a blueprint for the data warehouse database design to ensure that the design is extensible over time. The business process matrix aids in the strategic analysis task in two ways: Helps identify high-level analytical information that is required to satisfy the analytical needs for each business process, and serves as a method of cross checking whether you have all of the required business dimensions for each business process. Helps identify common business dimensions shared by different business processes. Business dimensions that are shared by more than one business process should be modeled with particular rigor, so that the analytical requirements of all processes that depend on them are supported. This is true even if one or more of the potential business processes are not selected for the first increment of the warehouse. Model the shared business dimensions to support all processes, so that later increments of the warehouse will not require a redesign of these crucial dimensions. A sample business process matrix is developed and shown in the slide, with business processes across the top and dimensions down the column on the very left side. Sample of business process matrix 40 40
Determining Granularity YEAR? QUARTER? MONTH? WEEK? DAY? Determining Granularity When gathering more specific information about measures and analytic parameters (dimensions), it is also important to understand the level of detail that is required for analysis and business decisions. Granularity is defined as the level of summarization (or detail) that will be maintained by your warehouse. The greater the level of detail, the finer the level of granularity. Grain is defined as the lowest level of detail that is retained in the warehouse, such as the transaction level. Such data is highly detailed and can then be summarized to any level that is required by the users. During your interviews, you should discern the level of detail that users need for near-term future analysis. After that is determined, identify whether there is a lower level of grain available in the source data. If so, you should design for at least one grain finer, and perhaps even to the lowest level of grain. Remember that you can always aggregate upward, but you cannot decompose the aggregate lower than the data that is stored in the warehouse. The level of granularity for each dimension determines the grain for the atomic level of the warehouse, which in turn will be used for rollups. 41 41
Identifying Business Rules Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Product Type Monitor Status PC 15 inch New Server 17 inch Rebuilt 19 inch Custom None Time Month > Quarter > Year Store Store > District > Region Identifying Business Rules Business model elements should also be documented with agreed-upon business rules and definitions. For example, the wholesale computer sales process might include the following business rules: All product items are grouped by status. March, April, and May make up the first quarter in the fiscal year. A store is in one and only one district. 42 42
Steps in designing a fact table Identify a business process for analysis (like sales). Identify measures or facts (sales dollar), by asking questions like 'What number of XX are relevant for the business process?', replacing the XX with various options that make sense within the context of the business. Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension), by asking questions that make sense within the context of the business, like 'Analyse by XX', where XX is replaced with the subject to test. List the columns that describe each dimension (region name, branch name, business unit name). Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollars). An alternative approach is the four step design process described in Kimball. – Check what it is