Download presentation
Presentation is loading. Please wait.
Published byCaroline Ray Modified over 9 years ago
1
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Warehousing II: Extract, Transform, and Load (ETL) BI Tools and Techniques Robert Monroe March 27, 2008
2
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Goals Provide a quick review of fundamental relational database design principles Understand key stages and challenges of ETL processing –Data reconciliation and cleansing –Data derivation Understand how to create dimensional models (star schemas) and why they are useful in data warehousing
3
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Quick Review: Relational Database Principles
4
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques The Relational Data Model The Relational Model has become the de-facto standard for managing operational business data Core concepts in a relational model: –Tables (relations) –Records (rows) –Data fields (columns) –Primary keys –Foreign keys Products Product IDDescriptionColorSizeQty Available 52Shoes (pair)Blue1025 64Socks (pair)WhiteLarge200 145BlouseGreen714 158PantsBlue32/340
5
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data, Information, Database Example Purchases Order IDCustomer NameProduct IDQuantityDate 5623Jimmy Hwang52312/15/2004 5624Sue Smith64512/16/2004 5625Jane Chen145112/16/2004 Products Product IDDescriptionColorSizeQty Available 52Shoes (pair)Blue1025 64Socks (pair)WhiteLarge200 145BlouseGreen714 158PantsBlue32/340 Jimmy Hwang purchased 3 pairs of size 10 shoes on 12/15/2004 What other information can we derive from these data tables? Data in Database Tables Information
6
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Relational Data, Tables, Records, and Metadata Example Purchases Order IDCustomer NameProduct IDQuantityDate 5623Jimmy Hwang52312/15/2004 5624Sue Smith64512/16/2004 5625Jane Chen145112/16/2004 Products Product IDDescriptionColorSizeQty Available 52Shoes (pair)Blue1025 64Socks (pair)WhiteLarge200 145BlouseGreen714 158PantsBlue32/340 Table Name: Products ProductID Int (pkey) Description Text(50) Color Text(50) SizeText(20) QtyAvailableInt Table Name: Purchases OrderIDInt (pkey) CustomerNameText(75) ProductIDInt (fkey) QuantityDecimal DateDateTime Data (Records) in Database Tables Metadata
7
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Normalization And Denormalization Data normalization is the process of decomposing relations with anomalies to produce smaller, well-structured relations –Basic idea: each table only holds data about one ‘thing’ Goals of normalization include: –Minimize data redundancy –Simplifying the enforcement of referential integrity constraints –Simplify data maintenance (inserts, updates, deletes) –Improve representation model to match “the real world” Normalization sometimes hurts query performance
8
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Example: Denormalized Table Insertion anomaly: when an employee takes a new class we need to add duplicate data (Name, Dept_Name, and Salary) Deletion anomaly: If we remove employee 140, we lose information about the existence of a Tax Acc class Modification anomaly: Employee 100 salary increase forces update of multiple records These anomalies exist because there are two themes (entity types) into one relation – course and employee, resulting in duplication, and an unnecessary dependency between the entities Employee Emp_IDNameDept_NameSalaryCourse_TitleDate_Completed 100Margaret SimpsonMarketing48000SPSS6/19/2005 100Margaret SimpsonMarketing48000Surveys10/7/2004 140Alan BeetonAccounting52000Tax Acc12/8/2004 110Chris LuceroInfo Systems43000SPSS1/12/2004 110Chris LuceroInfo Systems43000C++4/22/2003 190Lorenzo DavisFinance55000 150Susan MartinMarketing42000Java8/12/2002 150Susan MartinMarketing42000SPSS6/19/2005 Example Derived from Hoffer, Prescott, McFadden, Modern Database Management, 7th ed.
9
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Normalizing Previous Employee/Class Table Course_Completion Emp_IDCourse_IDDate_Completed 10016/19/2005 100210/7/2004 140312/8/2004 11011/12/2004 11044/22/2003 15016/19/2005 15058/12/2002 Employee Emp_IDNameDept_NameSalary 100Margaret SimpsonMarketing48000 140Alan BeetonAccounting52000 110Chris Lucero43000 190Lorenzo DavisFinance55000 150Susan MartinMarketing42000 Course Course_IDCourse_Title 1SPSS 2Surveys 3Tax Acc 4C++ 5Java This seems more complicated Why might this approach be superior to the previous one?
10
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Indexing An index is a table or other data structure used to determine the location of rows in a file that satisfy some condition Indices reduce the time needed to retrieve records … but increase the time and cost to insert, update, or delete Indexing is critical for high performance in large, complex db’s, –Especially data warehouses and data marts Products Product IDDescriptionColorSize 52Shoes (pair)Blue10 145Socks (pair)WhiteLarge 62BlouseGreen7 12PantsBlue32/34 532SkirtGreen7 ………… Product_Index Product IDRow 124 521 623 1452 5325 ……
11
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Alternative Data Models The relational data model is the current de-facto standard for storing and managing corporate data There are other data storage models, usually associated with legacy systems –The data you need for your analysis may be stored in them! Four common alternative data models –Flat file –Hierarchical –Network –Object
12
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Extract, Transform, and Load (ETL)
13
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Quick Review: Op. Systems Feed Analytic Systems Informational systems get their data from operational databases This process generally requires significant processing (transformation) of the data stored in operational databases This process is commonly known to as ETL –Extract, Transform, and Load (ETL)
14
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques The ETL Process The process of creating analytic data stores from operational data stores is commonly described as the Extract, Transform, and Load process, or ETL There are four basic steps to ETL –Capture/Extract source data –Cleanse (scrub) –Transform –Load and Index
15
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques The Three-Layer Data Architecture Data goes through three common stages during ETL Operational Data –transactional data stored in individual systems of record throughout the organization Reconciled Data –detailed, current data intended to be the single, authoritative source for all decision support applications Derived Data –data that have been selected, formatted, and aggregated for end-user decision support applications Operational Data Reconciled Data Derived Data
16
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Reconciling and Deriving Data Reconcile Data Derive Data Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
17
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques In-Class Exercise: ETL Form teams of 2-3 people Complete exercise 1 on handout
18
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Profiling First step: understand your source data –What is available? What is missing? –What is ‘good’ quality data? What is of questionable quality? –Data volumes, frequency, sparseness –Embedded business rules –Obvious (and subtle) data conflicts Ranges and formats Cardinality and uniqueness Key collisions This is a long, and often painful process that can require a lot of meticulous effort - budget and plan accordingly!
19
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Reconciling and Deriving Data Reconcile Data Derive Data Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
20
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Characteristics: Status vs. Event Data Status Event: a database action (create/update/delete) that results from a transaction Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
21
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Characteristics: Transient vs. Periodic Data Transient data: –Changes to existing records are written over previous records, thus destroying the previous data content Periodic data: –Never physically altered or deleted once they have been added to the store
22
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Reconciliation Typical operational data is: –Transient – not historical –Not always normalized (perhaps due to denormalization for performance) –Restricted in scope – not comprehensive –Sometimes poor quality – inconsistencies and errors After reconciliation, data should be: –Detailed – not summarized yet –Historical – periodic –Normalized – 3rd normal form or higher –Comprehensive – enterprise-wide perspective –Timely – data should be current enough to assist decision-making –Quality controlled – accurate with full integrity Operational Data Reconciled Data Derived Data
23
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Static extract Static extract: capturing a snapshot of the source data at a point in time Incremental extract Incremental extract: capturing changes that have occurred since the last static extract Capture/Extract: obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Data Reconciliation: Capture/Extract Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
24
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Extract Challenges / Issues What data should be extracted, and from where? How should it be extracted? How frequently should it be extracted?
25
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Fixing errors: Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Also: Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data Scrub/Cleanse: Use pattern recognition and AI techniques to upgrade data quality Rule of thumb: Automate where possible! Data Reconciliation: Scrub/Cleanse Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
26
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Common Data Cleansing Tasks Suppliers Supplier_IDSupplier NameContact Name 5623International Business MachinesJoe Smith 14534IBMJim Hwang qwq77dfsIntl. Business MachinesSusan Chen Supplier_Orders_US Order_IDItemQuantity_Tons 44253Salt100 14534Salt250 Quick exercise: How many suppliers are listed in this table? Quick exercise: how many pounds of salt were purchased? Supplier_Orders_Europe Order_IDItemQuantity 44253RoadSalt25 Truckloads 14534TableSalt500 Cases ???
27
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Common Data Cleansing Tasks Reconciling mismatched data fields across source databases –E.g. CompanyName field in db1 = Comp_Name field in db2 Finding or fixing missing data or data fields –Database 1 records “region” as part of address, database 2 does not Mismatched data types –Zip stored as a string in on source database and as an integer in another Converting between different units of measure –Kilograms in european divisions database, pounds in US database Resolving primary key collisions
28
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Quality Goal of cleansing stage is to improve data quality Common dimensions for measuring data quality: –Accuracy –Completeness –Consistency –Currency/Timeliness [Los03] Why is it so hard to achieve (and maintain) a high level of data quality in a data warehouse?
29
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Record-level transformation: Selection – data partitioning Joining – data combining Aggregation – data summarization Transform: convert data from format of operational system to format of data warehouse Data Reconciliation: Transform Field-level transformation: single-field – from one field to one field multi-field – from many fields to one, or one field to many Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
30
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Single Field Transform General transformation: –Directly maps and transforms individual fields in the source record directly to individual fields in the target record Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
31
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Single Field Transform Algorithmic transformation: –Uses a formula or logical expression to map and transforms individual fields in the source record directly to individual fields in the target record Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
32
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Single Field Transform Table look-up transformation: –Uses a separate table, keyed by source-code records to map and transforms individual fields in the source record directly to individual fields in the target record Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
33
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Multi-Field Transform M:1 maps many source fields to one target field transformation: Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
34
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Transform Examples: Multi-Field Transform 1:M maps and transforms one source field to many target fields Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
35
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Surrogate Keys Reconciled data tables should use surrogate keys –Surrogate keys are not business related –Surrogate keys are independent of operational store’s primary keys Surrogate keys are important because: –Avoid primary key collisions –Primary keys may change over time in source system –Ability to properly track changes over time –Consistency of key length/format/type
36
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Refresh mode: Refresh mode: bulk rewriting of target data at periodic intervals Load/Index: place transformed data into the warehouse and create indexes Data Reconciliation: Load and Index Update mode: Update mode: only changes in source data are written to data warehouse Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
37
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Data Reconciliation Recap After load/index, data reconciliation should be complete After reconciliation, data should be: –Detailed – not summarized yet –Historical – periodic –Comprehensive – enterprise-wide perspective –Timely – data is current enough to assist decision-making –Quality controlled – accurate with full integrity Operational Data Reconciled Data Derived Data
38
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques ETL Issue: Frequency Of Data Updates How should an organization decide the frequency of updates from operational databases to data warehouses/marts? What are the benefits and costs of frequent loads? What are the benefits and costs of infrequent loads?
39
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Derived Data
40
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Quick Review: Typical Data Warehouse Structure Reconcile Data Derive Data Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
41
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Derived Data Although reconciled data provides a consistent, hiqh-quality collection of enterprise data it is not necessarily in an efficient form for use by BI tools Derived data objectives: –Ease of use for decision support applications –Fast response to predefined user queries –Customized data for particular target audiences –Ad-hoc query support –Data mining capabilities Characteristics –Detailed (mostly periodic) data –Aggregated (for summary) –Processed –Distributed (to data marts) Operational Data Reconciled Data Derived Data
42
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Dimensional Modeling: Facts and Dimensions Dimensional Modeling –a simple database design in which dimensional data are separated from fact or event data. Dimensional models are also sometimes called star schemas. Dimensional models are a common way to represent derived data for informational data stores –Well suited to ad-hoc queries and OLAP –Poorly suited for transaction processing –Commonly used for data warehouse/mart storage model
43
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques 1:N relationship between dimension tables and fact tables Dimension tables are denormalized to maximize performance Star Schema Structure Dimension tables contain descriptions about the subjects of the business Fact tables contain factual or quantitative data Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
44
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Star Schema Example Fact table provides statistics for sales broken down by product, period and store dimensions Dimension tables provides details on stores, products, and time periods Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
45
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Star Schema Example With Data Product Period Store Sales Diagram Source: Hoffer, Prescott, McFadden, Modern Database Management, 7 th ed.
46
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques Dimensional Model Benefits Simple and predictable framework –Well suited to ad-hoc analytical queries –Relatively straightforward mapping from most transactional systems Dimensional independence –Query performance is somewhat independent of dimensions used in the query Straightforward model extensions support evolution
47
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques ETL Issue: Fact Table Granularity One of the biggest challenges in designing an effective star schema is deciding on the granularity of the fact data Transactional grain – finest level Aggregated grain – more summarized –Finer grains provide More detailed analysis capability More dimension tables, more rows in fact table (much larger storage) Allow better “drill-down” capabilities Rule of thumb: use the smallest granularity of fact data that is possible given your technical, storage, and computational constraints
48
© 2007 Robert T. Monroe Carnegie Mellon University ©2006 - 2008 Robert T. Monroe 45-875 BI Tools and Techniques In-Class Exercise: Dimensional Modeling Form teams of 2-3 people Complete exercise 2, question #1 on handout –Build a star schema to store grades at Millenium College
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.