CS 345: Topics in Data Warehousing Thursday, October 7, 2004.

Slides:



Advertisements
Similar presentations
The Organisation As A System An information management framework The Performance Organiser Data Warehousing.
Advertisements

Dimensional Modeling.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
BY LECTURER/ AISHA DAWOOD DW Lab # 4 Overview of Extraction, Transformation, and Loading.
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Top Reasons why users call ECCA. Agenda Reason for the call: What is the question or problem? Reason for the call: What is the question or problem? Answer.
Inter-Warehouse Transfers An Enhancement For iSeries 400 DMAS from  Copyright I/O International, 2004, 2005, 2007, 2010 Skip Intro.
Dimensional Modeling Business Intelligence Solutions.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
1 9 Ch3, Hachim Haddouti Adv. DBS and Data Warehouse CSC5301 Ch3 Hachim Haddouti Hachim Haddouti.
OLAP applications. Application areas OLAP most commonly used in the financial and marketing areas Data rich industries have been the most typical users.
Data Warehousing (Kimball, Ch.2-4) Dr. Vairam Arunachalam School of Accountancy, MU.
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
ETL Design and Development Michael A. Fudge, Jr.
Agenda Common terms used in the software of data warehousing and what they mean. Difference between a database and a data warehouse - the difference in.
CS 345: Topics in Data Warehousing Tuesday, October 5, 2004.
Business Intelligence
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
MSF Requirements Envisioning Phase Planning Phase.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Contents: Sales Process Handling Issues in Sales – A/R Sales - A/R.
Dimensional model. What do we know so far about … FACTS? “What is the process measuring?” Fact types:  Numeric Additive Semi-additive Non-additive (avg,
Optimizing Marketing Spend Through Multi-Source Conversion Attribution David Jenkins.
Dimensional Modeling Chapter 2. The Dimensional Data Model An alternative to the normalized data model Present information as simply as possible (easier.
Program Pelatihan Tenaga Infromasi dan Informatika Sistem Informasi Kesehatan Ari Cahyono.
Data Warehousing Concepts, by Dr. Khalil 1 Data Warehousing Design Dr. Awad Khalil Computer Science Department AUC.
Needles Powers Principles of Financial Accounting 12e Accounting for Merchandising Operations 6 C H A P T E R ©human/iStockphoto.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Slideshow 4A Accounts Payable Transaction Processing Part 1.
Chapter 1 Adamson & Venerable Spring Dimensional Modeling Dimensional Model Basics Fact & Dimension Tables Star Schema Granularity Facts and Measures.
CS 101 – Access notes Databases (Microsoft Access) 4 parts of a database database design –Try to understand the ideas behind database design, not just.
INVENTORY CASE STUDY. Introduction Optimized inventory levels in stores can have a major impact on chain profitability: minimize out-of-stocks reduce.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.
Basic Model: Retail Grocery Store
1 Data Warehousing Lecture-15 Issues of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
More Dimensional Modeling. Facts Types of Fact Design Transactional Periodic Snapshot –Predictable time period –Ex. Monthly, yearly, etc. Accumulating.
UNIT-II Principles of dimensional modeling
2012.  Activate the Inventory function  Set up Inventory Items in the Item list  Use QuickBooks to calculate the average cost of inventory  Record.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
Premium 2011 Processing Transactions in the DIVISION (Project) Module.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
1 Agenda – 04/02/2013 Discuss class schedule and deliverables. Discuss project. Design due on 04/18. Discuss data mart design. Use class exercise to design.
Data Warehousing (Kimball, Ch.5-12) Dr. Vairam Arunachalam School of Accountancy, MU.
 Shopping Basket  Stages to maintain shopping basket in framework  Viewing Shopping Basket.
Data warehousing theory and modelling techniques Graduate course on dimensional modelling.
June 08, 2011 How to design a DATA WAREHOUSE Linh Nguyen (Elly)
Chapter 24 Stock Handling and Inventory Control Section 24.1 The Stock Handling Process Section 24.2 Inventory Control Section 24.1 The Stock Handling.
Dr. Abdul Basit Siddiqui. The Process of Dimensional Modeling Four Step Method from ER to DM 1. Choose the Business Process 2. Choose the Grain 3. Choose.
I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.
Inventory Chapter 3. PAGE REF #CHAPTER 3: Inventory SLIDE # 2 2 Objectives Activate the Inventory function Set up Inventory Items in the Item list Set.
 TATA CONSULTANCY SERVICES MM - INVOICE VERIFICATION.
View Integration and Implementation Compromises
Inventory Transactions庫存交易
5 Accounting for Merchandising Operations
Star Schema.
Applying Data Warehouse Techniques
Overview and Fundamentals
Inventory is used to illustrate:
Retail Sales is used to illustrate a first dimensional model
Retail Sales is used to illustrate a first dimensional model
Retail Sales is used to illustrate a first dimensional model
Dimensional Model January 16, 2003
Applying Data Warehouse Techniques
Examines blended and separate transaction schemas
Review of Major Points Star schema Slowly changing dimensions Keys
Applying Data Warehouse Techniques
Review of Major Points Star schema Slowly changing dimensions Keys
Presentation transcript:

CS 345: Topics in Data Warehousing Thursday, October 7, 2004

Review of Thursday’s Class 4-step dimensional modeling process –Decide which process to model –Choose the grain for the fact table –Select the dimensions –Select the numeric measures for the facts Dimension topics –Date dimension –Surrogate keys –Degenerate dimensions –Snowflakes Fact topics –Additivity –Transactional vs. Snapshot

Outline of Today’s Class Facts –Semi-additive facts –“Factless” fact tables Slowly Changing Dimensions –Overwrite history –Preserve history –Hybrid schemes More dimension topics –Dimension roles –Junk dimension More fact topics –Multiple currencies –Master/Detail facts and fact allocation –Accumulating Snapshot fact tables

Transactional vs. Snapshot Facts Transactional –Each fact row represents a discrete event –Provides the most granular, detailed information Snapshot –Each fact row represents a point-in-time snapshot –Snapshots are taken at predefined time intervals Examples: Hourly, daily, or weekly snapshots –Provides a cumulative view –Used for continuous processes / measures of intensity –Examples: Account balance Inventory level Room temperature

Transactional vs. Snapshot Facts BrianOct. 1CREDIT+40 RajeevOct. 1CREDIT+10 BrianOct. 3DEBIT-10 RajeevOct. 3CREDIT+20 BrianOct. 4DEBIT-5 BrianOct. 4CREDIT+15 RajeevOct. 4CREDIT+50 BrianOct. 5DEBIT-20 RajeevOct. 5DEBIT-10 RajeevOct. 5DEBIT-15 BrianOct. 140 RajeevOct. 110 BrianOct. 240 RajeevOct. 210 BrianOct. 330 RajeevOct. 330 BrianOct. 440 RajeevOct. 480 BrianOct. 540 RajeevOct. 555 TransactionalSnapshot

Transactional vs. Snapshot Facts Two complementary organizations Information content is similar –Snapshot view can be always derived from transactional fact –But not the other way around. Why use snapshot facts? –Sampling is the only option for continuous processes E.g. sensor readings –Data compression Recording all transactional activity may be too much data! Stock price at each trade vs. opening / closing price –Query expressiveness Some queries are much easier to ask/answer with snapshot fact Example: Average daily balance

Semi-Additive Facts Snapshot facts are semi-additive Additive across non-date dimensions Not additive across date dimension Example: –Total account balance on Oct 1 = OK –Total account balance for Brian = NOT OK Time averages –Example: Average daily balance –Can be computed from snapshot fact First compute sum across all time periods Then divide by the number of time periods Can’t just use the SQL AVG() operator

Factless Fact Tables Transactional fact tables don’t have rows for non-events –Example: No rows for products that didn’t sell This has good and bad points. –Good: Take advantage of sparsity Much less data to store if events are “rare” –Bad: No record of non-events Example: What products on promotion didn’t sell? “Factless” fact table –A fact table without numeric fact columns –Used to capture relationships between dimensions –Include a dummy fact column that always has value 1 Examples: –Promotion coverage fact table Which products were on promotion in which stores for which days? Sort of like a periodic snapshot fact –Student/department mapping fact table What is the major field of study for each student? Even for students who didn’t enroll in any courses…

Slowly Changing Dimensions Compared to fact tables, contents of dimension tables are relatively stable. –New sales transactions occur constantly. –New products are introduced rarely. –New stores are opened very rarely. Attribute values for existing dimension rows do occasionally change over time –Customer moves to a new address –Grouping of stores into districts, regions changes due to corporate re-org How to handle gradual changes to dimensions? –Option 1: Overwrite history –Option 2: Preserve history

Overwriting History Simplest option: update the dimension table –“Type 1” slowly changing dimension Example: –Product size incorrectly recorded as “8 oz” instead of “18 oz” due to clerical error –Error is detected and fixed in operational system –Error should also be corrected in data warehouse Update row in dimension table Update pre-computed aggregates Updating dimension table rewrites history –Brian lived in WI in 1993 –Later, Brian moved to CA –Suppose we update the customer dimension table –Query: “Total sales to WI customers in 1993?” –Sales to Brian are incorrectly omitted from the query answer

Preserving History Accurate historical reporting is usually important in a data warehouse How can we capture changes while preserving history? Answer: Create a new dimension row –Old fact table rows point to the old row –New fact table rows point to the new row –“Type 2” slowly changing dimension Cust_keyNameSexStateYOB 457BrianMaleWI1976 …………… 784BrianMaleCA1976 Customer Dimension Old dimension row New dimension row

Slowly Changing Dim. Example Cust_keyNameSexStateYOB 457BrianMaleWI1976 …………… 784BrianMaleCA1976 Customer Dimension Cust_key…Quantity ……… 457…5 ……… 784…4 Sales Fact Existing fact rows use old dimension row New fact rows use new dimension row

Pros and Cons Type 1: Overwrite existing value +Simple to implement Type 2: Add a new dimension row +Accurate historical reporting +Pre-computed aggregates unaffected -Dimension table grows over time Type 2 SCD requires surrogate keys –Store mapping from operational key to most current surrogate key in data staging area To report on Brian’s activity over time, constrain on natural key –WHERE name = ‘Brian’

Choosing Type 1 vs. Type 2 Both choices are commonly used Easy to “mix and match” –Preserve history for some attributes –Overwrite history for other attributes Questions to ask: –Will queries want to use the original attribute value or the new attribute value? In most cases preserving history is desirable –Does the performance impact of additional dimension rows outweigh the benefit of preserving history? Some fields like “customer phone number” are not very useful for reporting Adding extra rows to preserve phone number history may not be worth it

Hybrid SCD Solutions Suppose we want to be able to report using either old or new values –Mostly useful for corporate reorganizations! –Example: Sales districts are re-drawn annually Solution: Create two dimension columns Approach #1: “Previous District” and “Current District” –Allows reporting using either the old or the new scheme –Whenever district assignments change, all “Current District” values are moved to “Previous District” –“Type 3” Slowly Changing Dimension Approach #2: “Historical District” and “Current District” –Allows reports with the original scheme or the current scheme –When district assignment changes, do two things: Create a new dimension row with “Historical District” = new value Overwrite relevant dim. rows to set “Current District” = new value

Dimension Roles Let’s consider an online auction data mart We’ll model auction results –Grain: one fact row per auction. –Bidding history stored in a different fact Dimensions: –Auction Start Date –Auction Close Date –Selling User –Buying User –Product “Date” and “User” occur twice –Date and User dimensions play multiple roles –Don’t create separate “Auction Start Date” and “Auction End Date” dimension tables –Do create a single Date dimension table –Do create two separate foreign keys to Date in the fact table

Junk Dimension Sometimes certain attributes don’t fit nicely into any dimension –Payment method (Cash vs. Credit Card vs. Check) –Bagging type (Paper vs. Plastic vs. None) Create one or more “miscellaneous” dimensions –Group together several leftover attributes as a dimension even if they aren’t logically related –Reduces number of dimension tables, width of fact table –Works best if leftover attributes are Few in number Low in cardinality Correlated –Example: 10 binary flags → no more than 2 10 =1024 dim. rows Some alternatives –Each leftover attribute becomes its own dimension –Eliminate leftover attributes that are not useful

International Issues International organizations often have facts denominated in different currencies –Some transactions are in dollars, others in Euros, still others in yen, etc. Reporting requirements may be diverse –Standard currency vs. local currency –Historical exchange rate vs. current exchange rate Time zones cause a similar problem –Sometimes local time is most meaningful E.g. buying patterns are different in morning vs. afternoon –Sometimes standardized time (e.g. GMT) is better Correctly express relative order of events

Handling Multiple Currencies Add a Currency dimension to the fact table –Values are US Dollars, Yen, Euros, etc. Each currency-denominated fact gets 2 fact columns –One column uses the local currency of the transaction –The other column stores the equivalent value in standard currency –Currency dimension is used to indicate the units being used in the local currency column –Historical exchange rate in effect the day of the transaction is used for the conversion Create a special currency conversion table –Store current conversion factor between each pair of currencies –Used to generate reports in any currency of interest

Multi-Currency Example ProductDateCurrencyAmtLocalAmtUSD KeyNameAbrvCountry 1US DollarUSDUSA 2Japanese YenJPYJapan 3Pound SterlingGBPUK 4EuroEUREurope Currency Dimension Sales Fact FromToFactor ……… Conversion Table

Master-Detail Facts Consider order data from an e-commerce site Each Order consists of a series of Lineitems Each Lineitem represents one product that is purchased Measurements are calculated at different levels –Each Lineitem has Quantity and Price –Each Order has Tax, Discount, and ShippingFee Natural design: two fact tables, different grains –Orders fact table with 1 row per order –Lineitem fact table with 1 row per line item

Orders and Lineitems Dimensions –Date –Customer –OrderID (degenerate) Fact Columns –Tax –Discount –ShippingFee –TotalPrice Dimensions –Date –Customer –Product –OrderID (degenerate) Fact Columns –Quantity –Price Orders FactLineitem Fact

A Problem with the Design Difficult to report on revenue by product Orders fact lacks Product dimension –Adding Product would violate the grain Lineitem fact lacks important revenue data –Effects of discount, tax, shipping are important –But they are not captured at the lineitem level! Solution: allocation of master-level facts to detail-level –Add Tax, Discount, and ShippingFee columns to the Lineitem fact table –Distribute Tax, Discount, and ShippingFee for the order among its component line items –Sum of allocated Tax for all line items in an order = actual overall Tax for that order –Different allocation policies are possible

Fact Allocation Policies Consider an order consisting of –A pillow –A bowling ball –A diamond ring How should the shipping cost be allocated? –By weight –By volume –By value Different policies yield to quite different results –Organizational politics can come into play –If the org. has a standard allocation policy, use it. –Otherwise, try to agree on one –Otherwise, provide all alternatives! Activity-based costing is a related concept –Methodology for allocating costs of administrative overheads Data warehousing projects can have useful side-effects

Accumulating Snapshot Facts Accumulating Snapshot is a third type of fact table –Transactional and Snapshot were already discussed –Not as common as the other two Useful for pipelined processes –Process proceeds through a series of stages –1 fact row tracks an entire process through its lifetime –Best for short-lived processes with linear workflow –Example: Order fulfillment for custom manufacturing Order placed → Release to Mfg → Finished Goods Inventory → Shipped → Delivered → Invoiced → Returned Characteristics of accumulating snapshot facts –Fact row is updated multiple times during process lifetime Different from append-only Transactional and Snapshot facts –Separate date dimension roles for each milestone –Numeric fact columns corresponding to various stages

Querying Accumulating Snapshots Reporting based on lag –How long does a process spend in a given pipeline stage? –Calculated by time lapse between dates –Average lag as a measurement Report on current state of the process –How many orders are currently in each stage? Reporting on historical state –Combine the Periodic Snapshot and Accumulating Snapshot fact table types –Take a periodic snapshot of the “active” rows of the Accumulating Snapshot fact –How many unshipped orders were waiting in inventory now vs. three months ago vs. six months ago?