Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Dimensional Modeling Concepts

Similar presentations


Presentation on theme: "Advanced Dimensional Modeling Concepts"— Presentation transcript:

1 Advanced Dimensional Modeling Concepts
Prof. Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani

2 Topics Mini-Dimensions Out-Triggers Drill-Across Conformed Dimensions
Time Dimension Multi-valued Dimensions Helper Tables Bridge tables Role Playing Dimensions

3 Rapidly Changing Monster Dimensions
Multi-million customer dimension present unique challenges that warrant special treatment: Browsing or constraining takes too long Type-II change not feasible Business users want to track the myriad of customer attribute changes, eg, insurance companies want accurate information of customers at the time of approval of a policy or when a claim is made

4 Mini-Dimensions Single technique to handle browsing-performance & change tracking problems Separate out frequently analyzed or frequently changing attributes into a separate dimension, called mini-dimension

5 Mini-Dimensions Demographic Key AGE GENDER INCOME LEVEL 1 20-24 M
< 20000 2 20K-24999 3 25K-29999 18 25-29 10

6 Mini-Dimensions Minidimension can not be itself allowed to grow very large 5 demographic attibutes Each attribute can take 10 distinct values How many rows in minidimension? 10,0000

7 Mini-Dimensions Separate out a package of demographic attributes into a demographic mini-dimension Age, gender, marital status, no. of children, income level, etc. One row in mini-dimension for each unique combination of these attibutes

8 Dimension-focused Queries
Standard OLAP queries are fact-focused Query touches one fact table and its associated dimensions Some types of analysis are dimension-focused Bring together data from different fact tables that have a dimension in common Common dimension used to coordinate facts Sometimes referred to as “drilling across”

9 Drill-Across Example Example scenario:
Sales fact with dimensions (Date, Customer, Product, Store) CustomerSupport fact with dimensions (Date, Customer, Product, ServiceRep) Question: How does frequency of support calls by California customers affect their purchases of Product X? Step 1: Query CustomerSupport fact Group by Customer SSN Filter on State = California Compute COUNT Query result has schema (Customer SSN, SupportCallCount) Step 2: Query Sales fact Filter on State = California, Product Name = Product X Compute SUM(TotalSalesAmt) Query result has schema (Customer SSN, TotalSalesAmt) Step 3: Combine query results Join Result 1 and Result 2 based on Customer SSN Group by SupportCallCount Compute COUNT, AVG(TotalSalesAmt)

10 A Problem with the Example
What if some customers don’t make any support calls? No rows for these customers in CustomerSupport fact No rows for these customers in result of Step 1 No data for these customers in result of Step 3 Solution: use outer join in Step 3 Customers who are in Step 2 but not Step 1 will be included in result of Step 3 Attributes from Step 1 result table will be NULL for these customers Convert these NULLs to an appropriate value before presenting results Using SQL NVL() function

11 Outer Join Left join, or inner join, select rows common to the participating tables to a join Selecting elements in a table regardless of whether they are present in the second table OUTER JOIN is the solution In Oracle, we will place an "(+)" in the WHERE clause on the other side of the table for which we want to include all the rows.

12 Outer Join Store_Information Geography
store_name Sales Date Los Angeles $1500 Jan San Diego $250 Jan $300 Jan Boston $700 region_name store_name East Boston New York West Los Angeles San Diego - We want to find out the sales amount for all of the stores - If we do a regular join, we will not be able to get what we want because we will have missed "New York," since it does not appear in the Store_Information table SELECT A1.store_name, SUM(A2.Sales) SALES FROM Geography A1, Store_Information A2 WHERE A1.store_name = A2.store_name (+) GROUP BY A1.store_name store_name SALES Boston $700 New York Los Angeles $1800 San Diego $250

13 NVL Function In Oracle/PLSQL, the NVL function lets you substitutes a value when a null value is encountered. NVL (string1, replace_with ) string1 is the string to test for a null value. Replace_with is the value returned if string1 is null. Example #1: select NVL (supplier_city, 'n/a') from suppliers; The SQL statement above would return 'n/a' if the supplier_city field contained a null value.  Otherwise, it would return the supplier_city value. Example #2: select supplier_id,  NVL (supplier_desc, supplier_name) from suppliers; This SQL statement would return the supplier_name field if the supplier_desc contained a null value.  Otherwise, it would return the supplier_desc. Example #3: select NVL (commission, 0) from sales; This SQL statement would return 0 if the commission field contained a null value.  Otherwise, it would return the commission field.

14 Conformed Dimensions Bottom-up data warehousing approach builds one data mart at a time Drill-across between data marts requires common dimension tables Common dimensions and attributes should be standardized across data marts Create master copy of each common dimension table Three types of “conformed” dimensions: Dimension table identical to master copy Dimension table has subset of rows from the master copy Can improve performance when many dimension rows are not relevant to a particular process Dimension table has subset of attributes from master copy Allows for roll-up dimensions at different grains

15 Conformed Dimension Example
Monthly sales forecasts Predicted sales for each brand in each district in each month POS Sales fact recorded at finer-grained detail Product SKU vs. Brand Date vs. Month Store vs. District Use roll-up dimensions Brand dimension is rolled-up version of master Product dimension One row per brand Only include attributes relevant at brand level or higher Month dimension is rolled-up Date District dimension is rolled-up Store Schema Sales (Date, Product, Store, Promotion, Transaction ID) Forecast (Month, Brand, District)

16 Drill-Across Example Question: How did actual sales diverge from forecasted sales in Sept. ‘04? Drill-across between Forecast and Sales Step 1: Query Forecast fact Group by Brand Name, District Name Filter on MonthAndYear =‘Sept 04’ Calculate SUM(ForecastAmt) Query result has schema (Brand Name, District Name, ForecastAmt) Step 2: Query Sales fact Calculate SUM(TotalSalesAmt) Query result has schema (Brand Name, District Name, TotalSalesAmt) Step 3: Combine query results Join Result 1 and Result 2 on Brand Name and District Name Result has schema (Brand Name, District Name, ForecastAmt, TotalSalesAmt) Outer join unnecessary assuming: Forecast exists for every brand, district, and month Every brand has some sales in every district during every month

17 Time Dimension Time is a unique & powerful dimension in every DM & EDW
Time dimension is very special & should be treated differently from other dimensions Example of a Star Schema Fact table records daily orders received by a manufacturing company Time dimension designates calendar days

18 FAQs About Time Dimension
Why can’t I just leave out the time dimension? Dimension tables serve as the source of constraints and as the source of report row headers A data mart is only as good as its dimension tables SQL provides some minimal assistance in navigating dates SQL certainly doesn't know anything about your corporate calendar, your fiscal periods, or your seasons

19 FAQs About Time Dimension
If I have to have a time dimension, where do I get it? Build it in a spreadsheet Some data marts additionally track time of day to the nearest minute or even the nearest second. For these cases separate the time of day measure out as a separate "numeric fact." It should not be combined into one key with the calendar day dimension. This would make for an impossibly large time table.

20 Time Dimension Guard against incompatible rollups like weeks & months
Separate fact tables denominated in weeks and months should be avoided at all costs Uniform time rollups should be used in all the separate data marts Daily data rolls up to every possible calendar

21 Time Dimension Be careful about aggregating non-additive facts wrt time Examples: inventory levels & account balances We need “average over time” SQL AVG ? SQL supports no “avgperiodsum” operator

22 Time Dimension Data mart around individual transactions or around month-end snapshots? Transaction intensive businesses like insurance What is the average length of time between the original claim and the first payment to the claimant? Record of the individual transactions is a poor basis for creating a month-end report for management Build two versions of the data mart: a tx version & a monthly snapshot version

23 Time Dimension

24 Time Dimension

25 Multi-valued Dimensions
Declaring grain of the fact table is one of the important design decisions Grain declares the exact meaning of a single fact record If the grain of the FT is clear, choosing Dimensions becomes easy

26 Multi-valued Dimensions
John & Mary Smith a single household John has a current account Mary has a savings account John & Mary have a joint current account, & credit card An account can have one, two or more customers associated with it

27 Multi-valued Dimensions
Customer as an account dimension attribute? Doing so violates the granularity of the dimension table as more than one customer could be associated with an account Customer as an additional dimension in the FT? Doing so violates the granularity of the FT (one row per account per month) Classic example of a multi-valued dimension How to model multi-valued dimensions?

28 Bridge Tables Account to Customer BRIDGE table Fact Table
account_id Fact Table account_id Account Dimension Account- related attributes account_id customer_id weight Bridge Table customer_id Customer Dimension Customer- related attributes

29 Multi-valued Dimensions
Time of Day Dimension Time_key(PK) Morning Rush Hour Mid Morning Lunch Hour Mid Afternoon Afternoon rush Hour

30 Multi-valued Dimensions
In the classical Grocery Store Sales Data Mart, the following dimensions are obvious for the daily grain: Calendar Date Product Store Promotion (assuming the promotion remains in effect the entire day) What about the Customer & Check-out clerk dimensions? Many values at the daily grain!

31 Multi-valued Dimensions
We disqualified the customer and check-out clerk dimension Many dimensions can get disqualified if we are dealing with an aggregated FT The more the FT is summarized, the fewer no. of dimensions we can attach to the fact records What about the converse? The more granular the data, the more dimensions make sense!!

32 Multi-valued Dimensions
Single-valued dimensions are welcome Any legitimate exceptions? If yes, how to handle them? Example: Healthcare Billing Grain is individual line item on a doctor/hospital bill The individual line items guide us through to the dimensions

33 Multi-valued Dimensions: Healthcare Example
Calendar Date (of incurred charges) Patient Doctor (usually called ‘provider’) Location Service Performed Diagnosis Payer (insurance co., employer, self) In many healthcare situations, there may be multiple values for diagnosis Really sick people having 10 different diagnoses!! How to model the Diagnosis Dimension?

34

35 Multi-valued Dimensions: Healthcare Example
Modeling Diagnoses Dimension: 4 ways: Disqualify the Diagnosis Dimension because it is MV Choose primary diagnosis & ignore others Extend the dimension list to have a fixed number of Diagnosis dimensions Location Put a helper table in between this fact table and the Diagnosis dimension table

36 Multi-valued Dimensions: Healthcare Example
Disqualify the Diagnosis Dimension because it is MV Easy way out but not recommended Choose primary diagnosis & ignore others Primary or admitting diagnosis Modeling problem taken care of, but is the diagnosis information useful in any way? Extend the dimension list to have a fixed number of Diagnosis dimensions Location Create a fixed number of additional Diagnosis dimension slots in the fact table key There will be some complicated example of a very sick patient who exceeds the number of Diagnosis slots you have allocated Multiple separate Diagnosis dimensions can not be queried easily If "headache" is a diagnosis, which Diagnosis dimension should be constrained? Avoid the multiple dimensions style of design as logic across dimension is notoriously slow on relational databases

37 Multi-valued Dimensions: Healthcare Example
Helper Table Approach Place a "helper" table between the Diagnosis dimension and the fact table The Diagnosis key in the fact table is changed to be a Diagnosis Group key The helper table in the middle is the Diagnosis Group table It has one record for each diagnosis in a group of diagnoses If I walk into the doctor's office with three diagnoses, then I need a Diagnosis Group with three records in it Either build these Diagnosis Groups for each individual or a library of "known" Diagnosis Groups

38 Multi-valued Dimensions: Healthcare Example
A helper table for an open-ended number of diagnoses

39 Multi-valued Dimensions: Healthcare Example
Helper Table Approach The Diagnosis Group table contains a very important numeric attribute: the weighting factor The weighting factor allows reports to be created that don't double count the Billed Amount in the fact table For instance, if you constrain some attribute in the Diagnosis dimension such as "Contagious Indicator" with the values Contagious and Not Contagious, then you can group by the Contagious Indicator and produce a report with the correct totals. To get the correct totals, we must multiply the Billed Amount by the associated weighting factor Assign the weighting factors equally within a Diagnosis Group. If there are three diagnoses, then each gets a weighting factor of 1/3 All weight factors in a Diagnosis Group always add up to 1

40 Multi-valued Dimensions: Healthcare Example
Helper Table Approach Deliberately omit the weighting factor and deliberately double count the same report grouped by Contagious Indicator An "impact report" is produced, that shows the total Billed Amount implied partially or fully by both values of Contagious Indicator. Correctly weighted report is the most common and makes the most sense Impact report is interesting and is requested from time to time. Such an impact report should be labeled so that the reader is not misled by any summary totals.

41 Multi-valued Dimensions
Helper Table Approach Helper table clearly violates the classic star join design where all the dimension tables have a simple one-to-many relationship to the fact table But it is the only viable solution to handling MV dimensions We can preserve the star join illusion in end-user interfaces by creating a view that prejoins the fact table to the helper table Other Applications: Retail Banks Standard Industry Classification

42 Helper Tables for Hierarchies
Helper Tables are useful for handling M:M relationship between FT & DT One more real world situation which can be modeled using helper tables Dimension with complex variable-depth hierarchy

43 Helper Tables for Hierarchies
Constant depth hierarchies: Store all levels of hierarchy in denormalized dimension table The preferred solution in almost all cases! Create “snowflake” schema with hierarchy captured in separate outrigger table Only recommended for huge dimension tables Storage savings have negligible impact in most cases

44 Variable Depth Hierarchies
Examples Corporate organization chart Consulting Invoices DM: Consulting services are sold at different organizational levels Need for reports that show consulting sold not only to ind. Departments, but also to division, subsidiaries and overall enterprise The report must still add up the separate consulting revenues for each organization structure

45 Variable Depth Hierarchies
Examples Parts composed of subparts Part1 Subpart 2 Subpart 3 Subpart 4 Subpart 5 Subpart 6 Subpart 7

46 Handling Hierarchies Solutions for variable-depth hierarchies?
Creating recursive foreign key to parent row is a possibility Employee dimension has “boss” attribute which is FK to Employee The CEO has NULL value for boss This approach is not recommended Cannot be queried effectively using SQL Alternative approach: bridge table

47 Handling Hierarchies Creating recursive foreign key to parent row is a possibility

48 Handling Hierarchies Creating recursive foreign key to parent row is a possibility Employee dimension has “boss” attribute which is FK to Employee The CEO has NULL value for boss This approach is not recommended Cannot be queried effectively using SQL ORACLE does provide “start with” & “connect by”

49 Handling Hierarchies Creating recursive foreign key to parent row is a possibility Employee dimension has “boss” attribute which is FK to Employee The CEO has NULL value for boss This approach is not recommended Cannot be queried effectively using SQL ORACLE does provide “start with” & “connect by”

50 Start With & Connect By create table test_connect_by (
The start with connect by clause can be used to select data that has a hierarchical relationship create table test_connect_by ( parent number, child number, constraint uq_tcb unique (child) ); insert into test_connect_by values ( 5, 2); insert into test_connect_by values ( 5, 3); insert into test_connect_by values ( 18, 11); insert into test_connect_by values ( 18, 7); insert into test_connect_by values ( 17, 9); insert into test_connect_by values ( 17, 8); insert into test_connect_by values ( 26, 13); insert into test_connect_by values ( 26, 1); insert into test_connect_by values ( 26, 12); insert into test_connect_by values ( 15, 10); insert into test_connect_by values ( 15, 5); insert into test_connect_by values ( 38, 15); insert into test_connect_by values ( 38, 17); insert into test_connect_by values ( 38, 6); 38, 26, 16 have no parents insert into test_connect_by values (null,38); insert into test_connect_by values (null,26); insert into test_connect_by values (null,16);

51 Bridge Tables Customer 1 Customer 2 Customer 3 Customer 4 Customer 5
Customer dimension has one row for each customer entity at any level of the hierarchy Separate bridge table has schema: Parent customer key Subsidiary customer key Depth of subsidiary Bottom flag Top flag One row in bridge table for every (ancestor, descendant) pair Customer counts as its own Depth-0 ancestor 16 rows for the hierarchy at right Fact table can join: Directly to customer dimension Through bridge table to customer dimension Customer 1 Customer 2 Customer 3 Customer 4 Customer 5 Customer 6 Customer 7 Fact Bridge Customer cust_id parent_id child_id

52 Bridge Table Example parent_id child_id depth top_flag bottom_flag 1 Y
Y N 2 3 4 5 6

53 Using Bridge Tables in Queries
Two join directions Navigate up the hierarchy Fact joins to subsidiary customer key Dimension joins to parent customer key Navigate down the hierarchy Fact joins to parent customer key Dimension joins to subsidiary customer key Safe uses of the bridge table: Filter on customer dimension restricts query to a single customer Use bridge table to combine data about that customer’s subsidiaries or parents Filter on bridge table restricts query to a single level Require Top Flag = Y Require Depth = 1 For immediate parent / child organizations Require (Depth = 1 OR (Depth < 1 AND Top Flag = Y)) Generalizes the previous example to properly treat top-level customers Other uses of the bridge table risk over-counting Bridge table is many-to-many between fact and dimension

54 Restricting to One Customer
parent_id child_id depth top_flag bottom_flag 1 Y N 2 3 4 5 6

55 Restricting to One Depth
parent_id child_id depth top_flag bottom_flag 1 Y N 2 3 4 5 6

56 Variable Depth Hierarchies
Examples Corporate organization chart How many Records in the Bridge Table? One record for each separate path from each node in the org. tree to itself & to every node below it 13 nodes (to themselves) 12 nodes below root 4+6 nodes below 1st level 2+4 nodes below 2nd level 2 nodes below 3rd level TOTAL = 43

57 Variable Depth Hierarchies

58 Variable-Depth Hierarchies
Advantages of bridge table approach: Any normal dimensional constraint against the Customer dimension table and the helper table will cause all the fact table records for the directly constrained customers plus all their subsidiaries to be correctly summarized. Standard relational databases and your standard query tools can be used to analyze the hierarchical structure.

59 Multi-Valued Dimensions
account_id Fact Table Weights for each account sum to 1 Allows for proper allocation of facts when using Customer dimension account_id Account Dimension Account- related attributes account_id customer_id weight Bridge Table customer_id Customer Dimension Customer- related attributes

60 Weighted Report vs. Impact Report
Two formulations for customer queries Weighted report Multiply all facts by weight before aggregating SUM(DollarAmt * weight) Subtotals and totals are meaningful Impact report Don’t use the weight column SUM(DollarAmt) Some facts are double-counted in totals Each customer is fully credited for his/her activity Most useful when grouping by customer

61 Role-Playing Dimension
What to do when a single dimension appears several times in the same fact table? Consider a fact table to record the status and final disposition of a customer order Dimensions of this table could be Order Date, Packaging Date, Shipping Date, Delivery Date, Payment Date, Return Date, Refer to Collection Date, Order Status, Customer, Product, Warehouse, and Promotion.

62 Role-Playing Dimension
Note that the first 7 dimensions are all time 7 FKs from the FT to the time dimension!! We can not join these 7 FKs to the same table SQL would interpret such a seven-way simultaneous join as requiring that all of the dates be the same Is this what we want?

63 Role-Playing Dimension
We need to make SQL believe that there are 7 independent time dimension tables The column labels in each of these tables should also be different! WHY? We will not be able to tell the columns apart if several of them have been dragged into a report How can we do this?

64 Role-Playing Dimension
We cannot literally use a single time table But we still want to build & maintain single time table behind the scenes Create an illusion for the user Make 7 identical physical copies of the time table Make 7 “virtual” copies using the SQL’s SYNONYM command Once these clones are in place, we still need to define a SQL view on each copy in order to make the field names uniquely different.

65 Role-Playing Dimension
Now that we have seven differently described Time dimensions, they can be used as if they were independent Can have completely unrelated constraints, and they can play different roles in a report

66 Role-Playing Dimension
Other Example Frequent Flyer flight segment FT need to include Flight Date, Segment Origin Airport, Segment Destination Airport, Trip Origin Airport, Trip Destination Airport, Flight, Fare Class, and Customer. The 4 Airport dimensions are 4 different roles played by a single underlying Airport table


Download ppt "Advanced Dimensional Modeling Concepts"

Similar presentations


Ads by Google