ISQS 3358, Business Intelligence Dimensional Modeling

ISQS 3358, Business Intelligence Dimensional Modeling
Zhangxi Lin Texas Tech University 1 1

Outline Data Warehousing Approaches Dimensional Modeling
Data Warehousing with Microsoft SQL Server 2005 Case: Adventure Works Cycles (AWC) : Data Warehouse Design Phases

Data Warehousing Approaches

Data Warehouse Development Approaches
Inmon Model: EDW approach Kimball Model: Data mart approach Which model is better? There is no one-size-fits-all strategy to data warehousing One alternative is the hosted warehouse

General Data Warehouse Development Approaches
“Big bang” approach Incremental approach: Top-down incremental approach Bottom-up incremental approach Warehouse Development Approaches The most challenging aspect of data warehousing lies not in its technical difficulty, but in choosing the best approach to data warehousing for your company’s structure and culture, and dealing with the organizational and political issues that will inevitably arise during implementation. Among the different approaches to developing a data warehouse are: “Big bang” approach Incremental approach Top-down incremental approach Bottom-up incremental approach ISQS 6339, Data Mgmt & BI, Zhangxi Lin

“Big Bang” Approach Analyze enterprise requirements Build enterprise
data warehouse Report in subsets or store in data marts “Big Bang” Approach Historically IT departments attempted to provide enterprisewide data warehouse implementations in a single project approach. Data warehouse development is a huge task, and it is a mistake to assume that the solution can be built all at once. The time required to develop the warehouse often means that user requirements and technologies change before the project is completed. In this approach, you perform the following: Analyze the entire information requirement for the organization Build the enterprise data warehouse to support these requirements Build access, as required, either directly or by subsetting to data marts ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Incremental Approach to Warehouse Development
Multiple iterations Shorter implementations Validation of each phase Increment 1 Strategy Definition Analysis Design Build Incremental Approach The incremental approach manages the growth of the data warehouse by developing incremental solutions that comply with the full-scale data warehouse architecture. Rather than starting by building an entire enterprisewide data warehouse as a first deliverable, start with just one or two subject areas, implement them as scalable data mart and roll them out to your end users. Then, after observing how users are actually using the warehouse, add the next subject area or the next increment of functionality to the system. This is also an iterative process. It is this iteration that keeps the data warehouse in line with the needs of the organization. Benefits Delivers a strategic data warehouse solution through incremental development efforts Provides extensible, scalable architecture Supports the information needs of the enterprise organization Quickly provides business benefit and ensures a much earlier return of investment Allows a data warehouse to be built based on a subject or application area at a time Allows the construction of an integrated data mart environment Iterative Production ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Top-Down Approach Analyze requirements at the enterprise level
Develop conceptual information model Identify and prioritize subject areas Complete a model of selected subject area Map to available data Perform a source system analysis Implement base technical architecture Establish metadata, extraction, and load processes for the initial subject area Create and populate the initial subject area data mart within the overall warehouse framework Top-Down Incremental Approach Advantages This approach has the following advantages: Provides a relatively quick implementation and payback. Typically, the scoping, definition study, and initial implementation are scaled down so that they can be completed in six to seven months. Offers significantly lower risk because it avoids being as analysis heavy as the “big bang” approach Emphasizes high-level business needs Achieves synergy among subject areas. Maximum information leverage is achieved as cross-functional reporting and a single version of the truth are made possible Disadvantages This approach has the following disadvantages: Requires an increase in up-front costs before the business sees any return on their investment Is difficult to define the boundaries of the scoping exercise if the business is global May not be suitable unless the client needs cross-functional reporting ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Bottom-Up Approach Define the scope and coverage of the data warehouse and analyze the source systems within this scope Define the initial increment based on the political pressure, assumed business benefit and data volume Implement base technical architecture and establish metadata, extraction, and load processes as required by increment Create and populate the initial subject areas within the overall warehouse framework Bottom-Up Incremental Approach This approach is similar to the top-down approach but the emphasis is on the data rather than the business benefit. Here, IT is in charge of the project either because IT wants to be in charge or the business has deferred the project to IT. Advantages This approach has the following advantages: This is a “proof of concept” type of approach, therefore it is often appealing to IT. It is easier to get IT buy-in for this approach because it is focused on IT. Disadvantages This approach has the following disadvantages: Because the solution model is typically developed from source systems and these source systems will have encapsulated within them the current business processes, the overall extensibility of the model will be compromised. IT staff is often the last to know about business changes—IT could be designing something that will be out of date before they complete its delivery. As the framework of definition in this approach tends to be much narrower, often a significant amount of reengineering work is required for each increment. ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Dimensional Modeling

Dimensional Model Also called star schema Facts Grain
Fact table is in the middle and dimensions serving as the points on the star. A normalized fact table plus denormalized dimension tables Facts Measurements associated with a specific business process. Most facts are additive (calculative); others are semi-additive, non-additive, or descriptive (e.g. factless fact table). Many facts can be derived from other facts. So, non-additive facts can be avoided by calculating it from additive facts. Grain The level of detail contained in the fact table The lowest level of detail is called atomic fact table 11

Dimensions The foundation of the dimensional model to describe the objects of the business The nouns of the DW/BI system Business processes (facts) are the verbs of the business Dimension tables link to all the business processes. A dimension shared across all processes is called conformed dimension The analysis involving data from more than one business process is called drill-across. 12

Data Cube Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another. Data cubes aren't restricted to just three dimensions. Most OLAP systems can build data cubes with many more dimensions allows up to 64 dimensions. In practice, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions. 13

Determining Granularity
YEAR? QUARTER? MONTH? WEEK? DAY? Determining Granularity When gathering more specific information about measures and analytic parameters (dimensions), it is also important to understand the level of detail that is required for analysis and business decisions. Granularity is defined as the level of summarization (or detail) that will be maintained by your warehouse. The greater the level of detail, the finer the level of granularity. Grain is defined as the lowest level of detail that is retained in the warehouse, such as the transaction level. Such data is highly detailed and can then be summarized to any level that is required by the users. During your interviews, you should discern the level of detail that users need for near-term future analysis. After that is determined, identify whether there is a lower level of grain available in the source data. If so, you should design for at least one grain finer, and perhaps even to the lowest level of grain. Remember that you can always aggregate upward, but you cannot decompose the aggregate lower than the data that is stored in the warehouse. The level of granularity for each dimension determines the grain for the atomic level of the warehouse, which in turn will be used for rollups. 14

Star Schema Model Product_id Store_id Product_disc,... District_id,...
Product Table Product_id Product_disc,... Store Table Store_id District_id,... Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units, ... Central fact table Denormalized dimensions Time Table Day_id Month_id Year_id,... Star Schema Model A star schema model can be depicted as a simple star; a central table contains fact data, and multiple tables radiate out from it, connected by database primary and foreign keys. Unlike other database structures, a star schema has denormalized dimensions. A star model: Is easy to understand by the users because the structure is so simple and straightforward Provides fast response to queries with optimization and reductions in the physical number of joins required between fact and dimension tables Contains simple metadata Is supported by many front end tools Is slow to build because of the level of denormalization The star schema is emerging as the predominant model for data warehouses or data marts. Item Table Item_id Item_desc,... 15 15

Snowflake Schema Model
Product Table Product_id Product_desc Store Table Store_id Store_desc District_id District Table District_id District_desc Sales Fact Table Item_id Store_id Product_id Week_id Sales_amount Sales_units Time Table Week_id Period_id Year_id Item Table Item_id Item_desc Dept_id Dept Table Dept_id Dept_desc Mgr_id Mgr Table Dept_id Mgr_id Mgr_name Snowflake Schema Model According to Ralph Kimball “a dimension is said to be snowflaked when the low cardinality fields in the dimension have been removed to separate tables and linked back into the original table with artificial keys.” A snowflake model is closer to an entity relationship diagram than the classic star model because the dimension data is more normalized. Developing a snowflake model means building class hierarchies out of each dimension (normalizing the data). A snowflake model: Results in severe performance degradation because of its greater number of table joins Provides a structure that is easier to change as requirements change Is quicker at loading data into its smaller normalized tables, compared to loading into a star schema’s larger denormalized tables Allows using history tables for changing data, rather than level fields (indicators) Has a complex metadata structure that is harder for end user tools to support One of the major reasons why the star schema model has become more predominant than the snowflake model is its query performance advantage. In a warehouse environment, the snowflake’s quicker load performance is much less important than its slower query performance. 16 16

Snowflake Schema Model
Direct use by some tools More flexible to change Provides for speedier data loading Can become large and unmanageable Degrades query performance More complex metadata Country State County City Snowflake Schema Model (continued) Besides the star and snowflake schemas, there are other models that can be considered. Constellation A constellation model (also called galaxy model) simply comprises a series of star models. Constellations are a useful design feature if you have a primary fact table, and summary tables of a different dimensionality. It can simplify design by allowing you to share dimensions among many fact tables. Third Normal Form Warehouse Some data warehouses consist of a set of relational tables that have been normalized to third normal form (3NF). Their data can be directly accessed by using SQL code. They may have more efficient data storage, at the price of slower query performance due to extensive table joins. Some large companies build a 3NF central data warehouse feeding dependent star data marts for specific lines of business. 17 17

Dimensional Modeling Process
High level dimensional model design Choosing business model Declaring the grain Choosing dimensions Identifying the facts Detailed dimensional model development Dimensional model review and validation IS Core users Business community Final design iteration ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Example: Commrex Real Estate Data Warehousing
Analytic themes How to encourage realtors to use the online ASP services Value Chain Listors create their account Listors post their real estate properties to the web-based database services and pay listing fees Property buyers search the website-based database and buy properties from listors. This is the incentive for listors to use the ASP services Business Processes Listor sign up Listor account management Property data posting Property search Property database maintenance 19

IMW’s Database ERD Model
Property Listing Database Membership Database Property ID M:1 Listor ID Listor ID    Listor Name M:M Property Type Property Type    Type Name Address Company ID Subtype 1 City    Subtype 2    Chapter    Chapter M:M Functions Subtype n Feature Specializations          Company ID Legends Comp Name Primary Key    Address Secondary Key    Telephone # Link to a table    ISQS 6339, Data Mgmt & BI, Zhangxi Lin

IMW’s Data Warehouse Dimensional Model
Property Listing Fact Membership Dimension Property SubType Dimension Property ID Listor ID Listor ID    Listor Name Prop SubType Prop SubType    SubType Name Address Company ID Property Type City       Chapter Chapter Functions Property Type Feature Specializations Property Type Dimension Type Name       Company ID Legends Comp Name Company Dimension    Primary Key Address    Secondary Key Telephone #    Link to a table ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Data Warehousing with Microsoft SQL Server 2005
ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Unified Dimensional Model (UDM)
A SQL Server 2005 technology A UDM is a structure that sits over the top of a data mart and looks exactly like an OLAP system to the end user. Advantages No need for a data mart. Can be built over one or more OLTP systems. Mixed data mart and OLTP system data Can include data from database from other vendors and XML- formatted data Allows OLAP cubes to be built directly on top of transactional data Low latency Ease of creation and maintenance ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Microsoft BI Toolset Relational engine (RDBMS)
T-SQL .NET Framework Command Language Runtime (CLR) SQL Server Integration Services (SSIS) – ETL Data Transformation Pipeline (DTP) Data Transformation Runtime (DTR) SQL Server Analysis Service (SSAS) – queries, ad hoc use, OLAP, data mining Multi-Dimensional eXpressions (MDX) – a scripting language for data retrieval from dimensional database Dimension design Cube design Data mining SQL Server Reporting Services (SSRS) – ad hoc query, report building Microsoft Visual Studio .NET is the fundamental tool for application development

Structure and Components of Business Intelligence
MS SQL Server 2005 SSMS SSIS SSAS BIDS SSRS SAS EG SAS EM

Understanding the Cube Designer Tabs
Cube Structure: Use this tab to modify the architecture of a cube. Dimension Usage: Use this tab to define the relationships between dimensions and measure groups, and the granularity of each dimension within each measure group. Calculations: Use this tab to examine calculations that are defined for the cube, to define new calculations for the whole cube or for a subcube, to reorder existing calculations, and to debug calculations step by step by using breakpoints. KPIs: Use this tab to create, edit, and modify the Key Performance Indicators (KPIs) in a cube. Actions: Use this tab to create or modify drillthrough, reporting, and other actions for the selected cube.. Partitions: Use this tab to create and manage the partitions for a cube. Partitions let you store sections of a cube in different locations with different properties, such as aggregation definitions. Perspectives: Use this tab to create and manage the perspectives in a cube. A perspective is a defined subset of a cube, and is used to reduce the perceived complexity of a cube to the business user. Translations: Use this tab to create and manage translated names for cube objects, such as month or product names. Browser: Use this tab to view data in the cube. ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Case: Adventure Works Cycles (AWC)

Case: Adventure Works Cycles (AWC)
A fictitious multinational manufacturer and seller of bicycles and accessories Based on Bothell, Washington, USA and has regional sales offices in several countries ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Basic Business Information
Product orders by category Product Orders by Country/Region Product Orders by Sales Channel Customers by Sales Channel Snapshot ISQS 6339, Data Mgmt & BI, Zhangxi Lin

AWC Business Requirements - Interview summary
Interviewee: Brian Welker, VP of Sales Sales to resellers: $37 million last year 17 people report to him including 3 regional sales managers Previous problem: Hard to get information out of the company’s system Major analytic areas: Sales planning Growth analysis Customer analysis Territory analysis Sales performance Basic sales reporting Price lists Special offers Customer satisfaction International support Success criteria Easy data access, Flexible reporting and analyzing, All data in one place What’s missing? – A lot – No indication of business value ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Business Processes Purchase Orders Distribution Center Deliveries
Distribution Center Inventory Store Deliveries Store Inventory Store Sales ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Analytic Themes See the Excel file AW_Analytic_Themes_List.xls
ISQS 6339, Data Mgmt & BI, Zhangxi Lin

AWC’s Bus Matrix Dimensions Business Process Date Product Employee
Dimensions Business Process Date Product Employee Customer (Reseller) Customer (Internet) Sales Territory Currency Channel Promotion Call Reason Facility Sales Forecasting X Orders Call Tracking Returns X X ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Prioritization Grid High Customer Profitability Orders Product
Forecast Business Value / Impact Call Tracking Exchange Rates Returns Manufacturing Costs Feasibility Low Low High ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Exercise 2 – A quick walk through an SSAS application
Learning Objectives How to design a data source view with SSAS based on an existing data warehouse How to design and deploy a cube. Tasks Analysis Service Tutorial Lesson 1: Defining a Data Source View within an Analysis Services Project Analysis Service Tutorial Lesson 2: Defining and Deploying a Cube Deliverable: A Word file with the screenshot of the star schema ed to The subject of the is: “ISQS 3358 Exercise 2” ISQS 6339, Data Mgmt & BI, Zhangxi Lin

Supplemental Slides : Data Warehouse Design Phases

Data Warehouse Database Design Phases
Phase 1: Defining the business model Phase 2: Defining the dimensional model Phase 3: Defining the physical model Data Warehouse Database Design Phases Several methods for designing a data warehouse have been published over the past years. Although these methods define certain terms differently, all include the same general tasks. These tasks have been grouped into three phases: Defining the business model A strategic analysis is performed to identify business processes for implementation in the warehouse. Then, a business requirements analysis is performed, where the business measures and business dimensions for each business process are identified and documented. Defining the dimensional model The business model is transformed into a dimensional model. Warehouse schema tables and table elements are defined, relationships between schema tables are established, and sources for warehouse data elements are recorded. Defining the physical model The dimensional model is transformed into a physical model. This includes the documentation of data element formats, database size planning, and the establishment of partitioning strategies, indexing strategies, and storage strategies. 37 37

Phase 1: Defining the Business Model
Performing strategic analysis Creating the business model Documenting metadata Phase 1: Defining the Business Model The first phase, business modeling, includes at least three tasks, each with associated deliverables. These tasks include strategic analysis, business model creation, and metadata document creation. Strategic Analysis: The primary business process (or processes) is selected for implementation in the warehouse. Business Model Creation: The business (conceptual) model is developed by uncovering detailed business requirements for a specific business process and verifying the existence of source data needed to support the business analysis requirements. Metadata Creation: The metadata is created in this first phase of the design process. The results of business model are summarized in the metadata tool, and this information serves as the essential resource for subsequent phases in the design process. 38 38

Performing Strategic Analysis
Identify crucial business processes Understand business processes Prioritize and select the business processes to implement High Business Benefit Performing Strategic Analysis Performed at the enterprise level, strategic analysis identifies, prioritizes, and selects the major business processes (also called business events or subject areas) that are most important to the overall corporate strategy. Strategic analysis includes the following steps: Identify the business processes that are most important to the overall corporate strategy. Examples of business processes are orders, invoices, shipments, inventory, sales, account administration, and the general ledger. Understand the business processes by drilling down on the dimensions that characterize each business process. The creation of a business process matrix can aid in this effort. Prioritize and select the business process to implement in the warehouse, based on which one will provide the quickest and largest return on investment (ROI). Low Low Feasibility High 39 39

Creating the Business Model
Defining business requirements: Identifying the business measures Identifying the dimensions Identifying the grain Identifying the business definitions and rules Verifying data sources Creating the Business Model The strategic analysis step produces a high-level definition of the chosen business process or processes. In this second step of the business modeling phase, a business model is created. Defining Business Requirements The business model is created by defining the business analysis requirements for each process. The previous lesson discussed interviewing end users to learn their query needs. You will also need to meet with business managers and business analysts who are directly responsible for the specific business processes in order to: Define specific business measures. Create a detailed listing of the dimensions that characterize each measure. Identify the granularity required to satisfy the analysis requirements. Clarify business definitions and business rules. Verifying Data Sources Concurrently, you must perform an information systems (IS) data audit, a systematic exploration of the underlying legacy source systems to verify that the data required to support the business requirements is available. 40 40

Business Requirements Drive the Design Process
Primary input Secondary input Business Requirements Business Requirements Drive the Design Process The entire scope of the data warehouse initiative must be driven by business requirements. Business requirements determine: What data must be available in the warehouse How data is to be organized How often data is updated End-user application templates Maintenance and growth Primary Input The business requirements are the primary input to the design of the data warehouse. Information requirements as defined by the business people—the end users—will lay the foundation for the data warehouse content. Existing Metadata Production ERD Model Research 41 41

Identifying Measures and Dimensions
The attribute varies continuously: Balance Units Sold Cost Sales Measures The attribute is perceived as constant or discrete: Product Location Time Size Dimensions Identifying Measures and Dimensions Measures A measure (or fact) contains a numeric value that measures an aspect of the business. Typical examples are gross sales dollars, total cost, profit, margin dollars, or quantity sold. A measure can be additive or partially additive across dimensions. Dimensions A dimension is an attribute by which measures can be characterized or analyzed. Dimensions bring meaning to raw data. Typical examples are customer name, date of order, or product brand. Ultimately, the business requirements document should contain a list of the business measures and a detailed list of all dimensions, down to the lowest level of detail for each dimension. An example is shown in the slide for a retail customer sales process. 42 42

Using a Business Process Matrix
Business Dimensions Business Processes Sales Returns Inventor y Customer Date Product Channel Promotion Using a Business Process Matrix A useful tool to understand and quantify business processes is the business process matrix (also called the process/dimension matrix). This matrix establishes a blueprint for the data warehouse database design to ensure that the design is extensible over time. The business process matrix aids in the strategic analysis task in two ways: Helps identify high-level analytical information that is required to satisfy the analytical needs for each business process, and serves as a method of cross checking whether you have all of the required business dimensions for each business process. Helps identify common business dimensions shared by different business processes. Business dimensions that are shared by more than one business process should be modeled with particular rigor, so that the analytical requirements of all processes that depend on them are supported. This is true even if one or more of the potential business processes are not selected for the first increment of the warehouse. Model the shared business dimensions to support all processes, so that later increments of the warehouse will not require a redesign of these crucial dimensions. A sample business process matrix is developed and shown in the slide, with business processes across the top and dimensions down the column on the very left side. Sample of business process matrix 43 43

Determining Granularity
YEAR? QUARTER? MONTH? WEEK? DAY? Determining Granularity When gathering more specific information about measures and analytic parameters (dimensions), it is also important to understand the level of detail that is required for analysis and business decisions. Granularity is defined as the level of summarization (or detail) that will be maintained by your warehouse. The greater the level of detail, the finer the level of granularity. Grain is defined as the lowest level of detail that is retained in the warehouse, such as the transaction level. Such data is highly detailed and can then be summarized to any level that is required by the users. During your interviews, you should discern the level of detail that users need for near-term future analysis. After that is determined, identify whether there is a lower level of grain available in the source data. If so, you should design for at least one grain finer, and perhaps even to the lowest level of grain. Remember that you can always aggregate upward, but you cannot decompose the aggregate lower than the data that is stored in the warehouse. The level of granularity for each dimension determines the grain for the atomic level of the warehouse, which in turn will be used for rollups. 44 44

Identifying Business Rules
Location Geographic proximity 0 - 1 miles 1 - 5 miles > 5 miles Product Type Monitor Status PC 15 inch New Server 17 inch Rebuilt 19 inch Custom None Time Month > Quarter > Year Store Store > District > Region Identifying Business Rules Business model elements should also be documented with agreed-upon business rules and definitions. For example, the wholesale computer sales process might include the following business rules: All product items are grouped by status. March, April, and May make up the first quarter in the fiscal year. A store is in one and only one district. 45 45

Documenting Metadata Documenting the design process
Documenting metadata should include: Documenting the design process Documenting the development process Providing a record of changes Recording enhancements over time Documenting Metadata Warehouse metadata is descriptive data about warehouse data and the processes that are used in creating the warehouse. It contains information used to map the data between the source systems and the warehouse, and additionally, contains transformation rules. The metadata repository (or document) should be created in the business modeling phase and used to record the first layer of business metadata. These business modeling results are summarized within the metadata and serve as the essential resource for subsequent phases in the design process. The metadata repository eventually contains detailed descriptions of the sources, content, structure, and physical attributes of the data warehouse. It is important to identify the business users who are the stewards or caretakers of the metadata. This keeps the business involved in the process while providing a clear, coherent understanding of metadata usage and definitions. 46 46

Metadata Documentation Approaches
Automated Data modeling tools ETL tools End-user tools Manual Metadata Documentation Approaches Regardless of the tools that you use to create a data warehouse, metadata must play a central role in the design, development, and ongoing evolution of the warehouse. Automated: There are three types of tools that automatically create and store metadata: Data modeling tools record metadata information as you perform modeling activities with the tool. ETL tools can also generate metadata. These tools also use the metadata repository as a resource to generate build and load scripts for the warehouse. End-user tools generally require the administrator to create a metadata layer that describes the structure and content of the data warehouse for that specific tool. Each of the tools used in your warehouse environment might generate its own set of metadata. The management and integration of different metadata repositories is one of the biggest challenges for the warehouse administrator. Manual: You can also create and manage your own metadata repository, using a tool that does not dynamically interface with the warehouse, such as a spreadsheet, word processor document, or custom database. The manual approach provides flexibility, however, it is severely hampered by the labor-intensive nature of managing a manual approach with the ongoing maintenance of metadata content. 47 47

Phase 2: Defining the Dimensional Model
Identify fact tables: Translate business measures into fact tables Analyze source system information for additional measures Identify dimension tables Link fact tables to the dimension tables Model the time dimension Phase 2: Creating the Dimensional Model The database design process begins with the enterprise view of the business and the specific subject areas that are to be implemented. These business needs determine the business model, which is a representation of business subjects and relationships. The dimensional modeling process is a top-down design approach. Use entity-relationship modeling to model your organization’s logical information requirements. Entity-relationship modeling involves identifying the things of importance (entities), the properties of these things (attributes), and how they are related to one another (relationships). While entity-relationship diagramming has traditionally been associated with highly normalized models such as OLTP applications, the technique is still useful for data warehouse design in the form of dimensional modeling. In dimensional modeling, instead of seeking to discover atomic units of information (such as entities and attributes) and all of the relationships between them, you identify which information belongs to a central fact table and which information belongs to its associated dimension tables. You identify business subjects or fields of data, define relationships between business subjects, and name the attributes for each subject. 48 48

Star Dimensional Modeling
Store Table Store_id District_id ... Item Table Item_id Item_desc Sales Fact Table Product_id Day_id Sales_amount Sales_units Product Table Product_desc Time Table Month_id Period_id Year_id Star Dimensional Model Star dimensional modeling is a logical design technique that seeks to present the data in a standard framework that is intuitive and provides high performance. Every dimensional model is composed of one table called the fact table, and a set of smaller tables called dimension tables. This characteristic (denormalized, star-like structure) is commonly known as a star model. Within this star model, redundant data is posted from one object to another for performance considerations. A fact table has a multipart primary key composed of two or more foreign keys and expresses a many-to-many relationship. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multipart key in the fact table. 49 49

Fact Table Characteristics
Contain numerical metrics of the business Can hold large volumes of data Can grow quickly Can contain base, derived, and summarized data Are typically additive Are joined to dimension tables through foreign keys that reference primary keys in the dimension tables Sales Fact Table Product_id Store_id Item_id Day_id Sales_amount Sales_units ... Fact Table Characteristics Facts are the numerical measures of the business. The fact table is the largest table in the star schema and is composed of large volumes of data, usually making up 90% or more of the total database size. It can be viewed in two parts: Multipart primary key Business metrics Numeric Additive (usually) Often a measure may be required in the warehouse, but it may not appear to be additive. These are known as semiadditive facts. Inventory and room temperature are two such numerical measurements. It does not make sense to add these numerical measurements over time, but they can be aggregated by using an SQL function other than sum, for example average. Although a star schema typically contains one fact table, other DSS schemas can contain multiple fact tables. 50 50

Dimension Table Characteristics
Dimension tables have the following characteristics: Contain textual information that represents the attributes of the business Contain relatively static data Are joined to a fact table through a foreign key reference Dimension Table Characteristics Dimensions are the textual descriptions of the business. Dimension tables are typically smaller than fact tables and the data changes much less frequently. Dimension tables give perspective regarding the whys and hows of the business and element transactions. Although dimensions generally contain relatively static data, customer dimensions are updated more frequently. Dimensions Are Essential for Analysis The key to a powerful dimensional model lies in the richness of the dimension attributes because they determine how facts can be analyzed. Dimensions can be considered as the entry point into “fact space.” Always name attributes in the users’ vocabulary. That way, the dimension will document itself and its expressive power will be apparent. 51 51

Star Dimensional Model Characteristics
The model is easy for users to understand. Primary keys represent a dimension. Nonforeign key columns are values. Facts are usually highly normalized. Dimensions are completely denormalized. Fast response to queries is provided. Performance is improved by reducing table joins. End users can express complex queries. Support is provided by many front-end tools. Star Model Characteristics Each foreign key column on the fact table represents a dimension. The nonprimary key columns in the fact table are values that can be aggregated. Fact tables do not contain character values; these belong with the dimensions. The star model structure is similar to how the users understand the information. The model provides better performance for analytical queries by reducing the number of joins. It allows complex queries to be expressed by end users, because the data is arranged in a way that is easy to understand and the relationships between entities are very clear. It restricts the numerical measurements of the business to the fact table. Note: The definitions of star and snowflake models vary among practitioners. Here the assumption is that the star model contains a fact table with one level of related dimensions. An example is sales fact and product dimension. The snowflake, on the other hand, has more than one level of dimension; that is, a hierarchy, for example, Sales Fact, Product Dimension, and Product Group. 52 52

Using Time in the Data Warehouse
Defining standards for time is critical. Aggregation based on time is complex. Using Time in the Data Warehouse Though it may seem obvious, real-life aggregations based on time can be quite complex. Which weeks roll up to which quarters? Is the first quarter the calendar months of January, February, and March, or the first 13 weeks of the year that begin on Monday? Some causes for nonstandardization are: Some countries start the work week on Mondays, others on Sunday. Weeks do not cleanly roll up to years, because a calendar year is one day longer than 52 weeks (one day longer in leap years). There are differences between calendar and fiscal periods. Consider a warehouse that includes data from multiple organizations, each with its own calendars. Holidays are not the same for all organizations and all locations. Representing time is critical in the data warehouse. You may decide to store multiple hierarchies in the data warehouse to satisfy the varied definitions of units of time. If you are using external data, you may find that you create a hierarchy or translation table simply to be able to integrate the data. Matching the granularity of time defined in external data to the time dimension in your own warehouse may be quite difficult. 53 53

Where should the element of time be stored?
The Time Dimension Time is critical to the data warehouse. A consistent representation of time is required for extensibility. Where should the element of time be stored? Time dimension Sales fact The Time Dimension Because online transaction data, typically the source data for the warehouse, does not have a time element, you apply an element of time in the extraction, transformation, and transportation process. For example, you might assign a week identifier to all the airline tickets that sold within that week. The transaction may not have a time or date stamp on it, but you know what date the sale has occurred by the generation of the transaction file. The dimension of time is most critical to the data warehouse. A consistent representation of time is required for extensibility. Storing the Time Dimension Typically there is a time dimension table in the data warehouse although time elements may be stored on the fact table. Before deciding where to store time, you must consider the following: Almost every data warehouse has a time dimension. Organizations use a variety of time periods for data analysis. A row whose key is an SQL date may be populated with additional time qualifiers needed to perform business analysis, such as workday, fiscal period, and special events. 54 54

Using Data Modeling Tools
Tools with a GUI enable definition, modeling, and reporting. Avoid a mix of modeling techniques caused by: Development pressures Developers with lack of knowledge No strategy Determine a strategy. Write and publish formally. Make available electronically. Using Data Modeling Tools Your logical design should result in: 1. A set of entities and attributes corresponding to fact tables and dimension tables. 2. A model of operational data from your source into subject-oriented information in your target data warehouse schema. You can create the logical design by using a pen and paper, or you can use a design tool such as Oracle Warehouse Builder (specifically designed to support modeling the ETL process) or Oracle Designer (a general purpose modeling tool). You can generally model the warehouse database by using tools that provide a GUI for: Entering metadata definitions of facts, dimensions, hierarchies, and relationships Drawing diagrams of star schemas containing the facts and dimensions Documenting business requirements Defining integrity rules and constraints Generating reports about your metadata definitions 55 55

Phase 3: Defining the Physical Model
Why Huge amount of data must be effectively processed and retrieved in realtime. How Translate the dimensional design to a physical model for implementation. Define storage strategy for tables and indexes. Perform database sizing. Define initial indexing strategy. Define partitioning strategy. Update metadata document with physical information. Phase 3: Creating the Physical Model The physical model resides in the relational database (RDBMS) server. You need to ensure that each object stored (primarily tables) is held in the appropriate manner and contains all the necessary indexes to ensure optimal performance. There are other considerations that you should bear in mind for performance, such as data partitioning. Dimensional Model to Physical Model The mapping of the dimensional model to the physical elements is accomplished by performing the following to the base dimensional model: Add the format, such as data types and lengths to the attributes of each entity Define storage strategy for tables and indexes Perform database sizing Define the initial indexing strategy Define partitioning strategy Update metadata document 56 56

Storage and Performance Considerations
Database sizing Data partitioning Indexing Star query optimization Storage and Performance Considerations One of the main challenges within data warehousing is to recognize that fact and detail tables will grow incredibly large and to manage that growth successfully. Query performance continues to present challenges as these fact tables grow. Partitioning, indexing, and off loading data that is no longer required are essential to sustaining a healthy data warehouse. 57 57

Database Sizing - Test Load Sampling
Analyze a representative sample of the data chosen using proven statistical methods. Ensure that the sample reflects: Test loads for different periods Day-to-day operations Seasonal data and worst-case scenarios Indexes and summaries Test Load Sampling A good approach to sizing is based on the analysis of a representative sample of the data chosen using proven statistical methods. Test loads can be performed on data from a day, week, month, or any other period of time. You must ensure that the sample periods reflect the true day-to-day operations of your company, and that the results include any seasonal issues or other factors, such as worst-case scenarios, that may prejudice the results. After you have determined the number of transactions based on the sample, you calculate the size. You must also consider the following factors that can have an impact: Indexing, because the amount of indexing can significantly impact the size of the database Summary tables that can be as large as the primary fact table, depending on the number of dimensions and the number of levels of the hierarchies associated with those dimensions 58 58

Data Partitioning Breaking up of data into separate physical units that can be handled independently Types of data partitioning Horizontal partitioning. Vertical partitioning Data Partitioning Partitioning can provide tremendous benefits to a wide variety of applications by improving manageability, performance, and availability. It is not unusual for partitioning to improve the performance of certain queries or maintenance operations by an order of magnitude. Moreover, partitioning can greatly simplify common administration tasks. [George Lumpkin, Oracle9i Partitioning, An Oracle White Paper, May 2002] Partitioning enables you to break tables down into smaller, more manageable units, thus addressing the problems of supporting large tables and indexes (which are inherent in data warehouses). A large table is broken into many smaller physical tables or views, and then they are pulled together again for query actions that access data from more than one of the tables or views. The data can be partitioned horizontally or vertically. Partitioning helps in the following ways: Improves the speed of access and data management by eliminating the need to visit both vertical or horizontal partitions during query and backup tasks Increases the availability by reducing the time to perform all the warehouse management tasks (such as load) and the ability to take one area of the database offline and keep others active 59 59

Indexing Indexing is used for the following reasons:
It is a huge cost saving, greatly improving performance and scalability. It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed. Indexing Data By intelligently indexing data in your data warehouse, you can increase both the performance and scalability of your warehouse solution. Using indexes, you can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed. The types of indexes that are supported by Oracle9i are described in the following slides. 60 60

Parallelism P1 P2 P3 Sales table Customers table
Parallelism is the ability to apply multiple CPU and I/O resources to the execution of a single SQL command. Simply expressed, parallelism is the idea of breaking down a task so that, instead of one process doing all of the work in a query, many processes do part of the work at the same time. An example of this is when four processes handle four different quarters in a year instead of one process handling all four quarters by itself. The improvement in performance can be quite high. In this case, each quarter will be a partition, a smaller and more manageable unit of an index or table. Oracle’s unique parallel architecture allows any query to execute with any degree of parallelism. Oracle intelligently chooses the degree of parallelism for each query, based upon the complexity of the query, the size of the tables in the query, the hardware configuration, and the currently level of activity on the system. Parallelism is a fundamental performance feature for executing queries over large volumes of data. Note: For more information, refer to: Oracle9i Data Warehousing Guide Release 2 (9.2) Part Number A Parallel Execution Servers 61 61

Using Summary Data Designing summary tables offers the following benefits: Provides fast access to precomputed data Reduces use of I/O, CPU, and memory Using Summary Data Another technique employed in data warehouses to improve performance is the creation of summaries. Summaries are created in Oracle by using a schema object called a materialized view. Materialized Views for Data Warehouses In data warehouses, you can use materialized views to precompute and store aggregated data such as the sum of sales. They can also be used to precompute joins with or without aggregations. A materialized view eliminates the overhead that is associated with expensive joins and aggregations for a large or important class of queries. Having direct access to a summary table containing precomputed data reduces the disk I/O, and CPU sort, and memory swapping requirements. Materialized views within the data warehouse are transparent to the end user or to the database application. The database administrator creates one or more materialized views. The end user queries the tables and views at the detail data level. The query rewrite mechanism in the Oracle server automatically rewrites the SQL query to use the summary tables. 62 62

ISQS 3358, Business Intelligence Dimensional Modeling

Similar presentations

Presentation on theme: "ISQS 3358, Business Intelligence Dimensional Modeling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ISQS 3358, Business Intelligence Dimensional Modeling

Similar presentations

Presentation on theme: "ISQS 3358, Business Intelligence Dimensional Modeling"— Presentation transcript:

Similar presentations

About project

Feedback