Copyright  Oracle Corporation, 1999. All rights reserved. 1111 Transforming Data.

Slides:



Advertisements
Similar presentations
C6 Databases.
Advertisements

Multidimensional Database Structure
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 10 Managing a Database.
Designing the data warehouse / data marts Part 2.
Exploiting the DW data DW is a platform for creating a wide array of reports It solves data feed problems, but does not lead to specific decision support.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Components and Architecture CS 543 – Data Warehousing.
Living in a Digital World Discovering Computers 2010.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
5 Copyright © 2009, Oracle. All rights reserved. Defining ETL Mappings for Staging Data.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
ETL By Dr. Gabriel.
1 Sharif University Data Warehouse. 2 Sharif University Objectives Need for Data Warehouse. What is Data Warehouse? Data Warehouse Properties. Data Warehouse.
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Database Systems – Data Warehousing
Discovering Computers Fundamentals, 2012 Edition Your Interactive Guide to the Digital World.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Data Warehouse Chapter 11. Multiple Files Problem Added complexity of multiple source files Start simple Multiple Source files Extracted data Logic to.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,
I Copyright © 2004, Oracle. All rights reserved. Introduction.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Concepts and Terminology Introduction to Database.
Chapter 4: Organizing and Manipulating the Data in Databases
Case 2: Emerson and Sanofi Data stewards seek data conformity
© 2007 by Prentice Hall 1 Introduction to databases.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Describe the qualities of valuable information.
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
6 Extraction, Transformation, and Loading (ETL) Transformation.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
DataMAPPER - Applied Database Tech. 이화여대 과학기술대학원 석사 3 학기 992COG08 김지혜.
Transportation: Loading Warehouse Data Chapter 12.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
D Copyright © Oracle Corporation, All rights reserved. Loading Data into a Database.
Transportation: Refreshing Warehouse Data Chapter 13.
7 Strategies for Extracting, Transforming, and Loading.
1 Chapter 9 Database Management. Objectives Overview Define the term, database, and explain how a database interacts with data and information Describe.
Chapter 11 Information and Data Management Discovering Computers Technology in a World of Computers, Mobile Devices, and the Internet.
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
20 Copyright © 2008, Oracle. All rights reserved. Cache Management.
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Chapter 5-1. Chapter 5-2 Chapter 5: Organizing and Manipulating the Data in Databases Introduction Normalization Validating the Data in Databases Extracting.
21 Copyright © 2008, Oracle. All rights reserved. Enabling Usage Tracking.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
5 Copyright © 2008, Oracle. All rights reserved. Testing and Validating a Repository.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
4 Copyright © Oracle Corporation, All rights reserved. Modeling the Data Warehouse.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved. Introduction to Essbase.
 CONACT UC:  Magnific training   
6 Copyright © 2006, Oracle. All rights reserved. The ETL Process: Transforming Data.
C Copyright © 2007, Oracle. All rights reserved. Introduction to Data Warehousing Fundamentals.
Copyright  Oracle Corporation, All rights reserved Building the Warehouse.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data warehouse and OLAP
Using Partitions and Fragments
Chapter Ten Managing a Database.
Database Management Systems
Data Warehousing Concepts
Best Practices in Higher Education Student Data Warehousing Forum
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

Copyright  Oracle Corporation, All rights reserved Transforming Data

11-2 Copyright  Oracle Corporation, All rights reserved. Overview Project Management (Methodology, Maintaining Metadata) Defining DW Concepts & Terminology Planning for a Successful Warehouse Analyzing User Query Needs Choosing a Computing Architecture Modeling the Data Warehouse Planning Warehouse Storage ETT (Building the Warehouse) Meeting a Business Need Supporting End User Access Managing the Data Warehouse

11-3 Copyright  Oracle Corporation, All rights reserved. Objectives After completing this lesson, you should be able to do the following: Explain the importance of quality data Define the term “transformation” Identify transformation issues Describe techniques for transforming data List tools that can be used to transform data After completing this lesson, you should be able to do the following: Explain the importance of quality data Define the term “transformation” Identify transformation issues Describe techniques for transforming data List tools that can be used to transform data

11-4 Copyright  Oracle Corporation, All rights reserved. Importance of Data Quality Summit Sports Hollywood Speedy Pizza Browser: Hollywood X + Customers: a recorof as X + Customers: Browser: Hollywood Browser: Hollywood X +

11-5 Copyright  Oracle Corporation, All rights reserved. Benefits of Quality Data Clean data is essential for: –Targeting customers –Determining buying patterns –Identifying householders: private and commercial –Matching customers –Identify historical data Dirty data must be removed. Clean data is essential for: –Targeting customers –Determining buying patterns –Identifying householders: private and commercial –Matching customers –Identify historical data Dirty data must be removed.

11-6 Copyright  Oracle Corporation, All rights reserved. Standards Define a quality strategy. Decide on optimal data-quality level. Define a quality strategy. Decide on optimal data-quality level.

11-7 Copyright  Oracle Corporation, All rights reserved. Quality Improvements Consider modifying rules for operational data. Document the sources. Create a data stewardship program. Design the cleanup process carefully. Initial cleanup and refresh routines may differ. Consider modifying rules for operational data. Document the sources. Create a data stewardship program. Design the cleanup process carefully. Initial cleanup and refresh routines may differ.

11-8 Copyright  Oracle Corporation, All rights reserved. Guidelines Operational data should not be used directly in the warehouse Operational data must be cleaned for each increment Operational data is not simply fixed by modifying applications Operational data should not be used directly in the warehouse Operational data must be cleaned for each increment Operational data is not simply fixed by modifying applications

11-9 Copyright  Oracle Corporation, All rights reserved. Solutions Conventional COBOL, 4GL Specialized tools Customized conversion process Business experts Conventional COBOL, 4GL Specialized tools Customized conversion process Business experts Investigation Conditioning Standardization Integration

11-10 Copyright  Oracle Corporation, All rights reserved. Management Poor data quality Own Take responsibility Resolve problems Data quality manager Poor data quality Own Take responsibility Resolve problems Data quality manager

11-11 Copyright  Oracle Corporation, All rights reserved. Transformation Transformation eliminates operational data anomalies Cleans Standardizes Presents subject-oriented data Transformation eliminates operational data anomalies Cleans Standardizes Presents subject-oriented data Extract Transform Operational system Warehouse Clean up Consolidate Restructure Data staging area Transport (Load)

11-12 Copyright  Oracle Corporation, All rights reserved. Source Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies Oracle Corp 100 NE 1st Street, Tampa Oracle100 NE. First St., Tampa Oracle Services 100 North East 1st St., FLA Oracle Limited100 N.E. 1st St Oracle Computing15 Main Road, Ft. Lauderdale Oracle Corp. UK15 Main Road, Ft. Lauderdale, FLA Oracle Corp UK Ltd181 North Street, Key West, FLA CUSNUM NAME ADDRESS

11-13 Copyright  Oracle Corporation, All rights reserved. Transformation Routines Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load

11-14 Copyright  Oracle Corporation, All rights reserved. Transforming Data: Problems and Solutions Multipart keys CountrycodeSalesterritoryProductnumberSalespersoncode Product code = 12 M

11-15 Copyright  Oracle Corporation, All rights reserved. If field not in (‘m’,1,’male’) then … else if field is NULL else if field is NULL then … then … Transforming Data Multiple encoding Must pick up erroneous data Multiple encoding Must pick up erroneous data m, f 1, 0 male, female m, f mle, female 1, NULL

11-16 Copyright  Oracle Corporation, All rights reserved. Transforming Data Multiple local standards Tools or filters to preprocess Multiple local standards Tools or filters to preprocess cm inches cm DD/MM/YY MM/DD/YY DD-Mon-YY 1,000 GBP FF 9,990 USD 600

11-17 Copyright  Oracle Corporation, All rights reserved. Multiple Files Problem Added complexity of multiple source files Start simple Added complexity of multiple source files Start simple Extracteddata Multiple source files Logic to detect correct source

11-18 Copyright  Oracle Corporation, All rights reserved. Transforming Data from Multiple Files File

11-19 Copyright  Oracle Corporation, All rights reserved. Missing Values Problem Solution Ignore Wait Mark rows Extract when time-stamped Solution Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘A’ A

11-20 Copyright  Oracle Corporation, All rights reserved. Duplicate Value Problem Solution SQL self-join techniques RDMBS constraint utilities Solution SQL self-join techniques RDMBS constraint utilities ACME Inc SELECT … FROM table_a, table_b WHERE table_a.key (+) = table_b.key UNION SELECT … FROM table_a, table_b WHERE table_a.key = table_b.key (+)

11-21 Copyright  Oracle Corporation, All rights reserved. Solution CTAS SQL*Loader Solution CTAS SQL*Loader Element Names Problem Customer Browser: Hollywood X + Customers: a recorof as X + Customers: Browser: Hollywood Browser: Hollywood X % 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO Customer Client Contact Name

11-22 Copyright  Oracle Corporation, All rights reserved. Element Meaning Problem Customer’s name All customer details All details except name a recorof as X + Customers: Browser: Hollywood Customer_detail Avoid misinterpretation Complex solution Document meaning in metadata

11-23 Copyright  Oracle Corporation, All rights reserved. Input Format Problem ASCIIEBCDIC 12373“123-73” ACME Co. áøåëéí äáàéíBeer (Pack of 8)

11-24 Copyright  Oracle Corporation, All rights reserved. Referential Integrity Problem Solution SQL anti-join Server constraints Dedicated tools Solution SQL anti-join Server constraints Dedicated tools Department Emp Name Department 1099Smith Jones Doe Harris60

11-25 Copyright  Oracle Corporation, All rights reserved. Name and Address Problem No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines Database 1 Database 2 DIANNE ZIEFELD N100 HARRY H. ENFIELDD589 FRED AND SARA MULLEN M300 NAMELOCATION ZIEFLED, DIANNE 100 ENFIELD, HARRY H 589 MULLEN, SARA AND FRED300

11-26 Copyright  Oracle Corporation, All rights reserved. Name and Address Problem Single-field format Multiple-field format Multiple-field format Single-field format Multiple-field format Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, Name Mr. J. Smith Street100 Main St. TownBigtown CountyCounty Luth Code23565

11-27 Copyright  Oracle Corporation, All rights reserved. Clean and Organize 1.Create atomic values. 2.Standardize formats. 3.Verify data accuracy. 4.Match with other records. 5. Identify private and commercial addresses and inhabitants. 6.Document in metadata. Requires sophisticated tools and techniques 1.Create atomic values. 2.Standardize formats. 3.Verify data accuracy. 4.Match with other records. 5. Identify private and commercial addresses and inhabitants. 6.Document in metadata. Requires sophisticated tools and techniques

11-28 Copyright  Oracle Corporation, All rights reserved. Merging Data Operational transactions do not usually map one-to-one with warehouse data Data for the warehouse is merged to provide information for analysis Operational transactions do not usually map one-to-one with warehouse data Data for the warehouse is merged to provide information for analysis Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:02 Anchovy Pizza $12.00 Return1/2/98 12:00:03 Anchovy Pizza - $12.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00 Pizza sales/returns by day, hour, seconds

11-29 Copyright  Oracle Corporation, All rights reserved. Merging Data a recorof as X + Customers: Browser: Hollywood Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00 Sale1/2/98 12:00:02 Anchovy Pizza $12.00 Return1/2/98 12:00:03 Anchovy Pizza - $12.00 Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00

11-30 Copyright  Oracle Corporation, All rights reserved. Adding a Date Stamp Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data

11-31 Copyright  Oracle Corporation, All rights reserved. Adding a Date Stamp Item Table Item_id Dept_id Time_key Time Table Week_id Period_id Year_id Time_key Store Table Store_id District_id Time_key Product Table Product_id Time_key Product_desc Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units

11-32 Copyright  Oracle Corporation, All rights reserved. Adding a Date Stamp Fact table –Add triggers –Recode applications –Compare tables Dimension table Time representation –Point in time –Time span Fact table –Add triggers –Recode applications –Compare tables Dimension table Time representation –Point in time –Time span

11-33 Copyright  Oracle Corporation, All rights reserved. Adding Keys to Data #1Sale1/2/98 12:00:01 Ham Pizza $10.00 #2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #3Sale1/2/98 12:00:02 Anchovy Pizza $12.00 #5Sale1/2/98 12:00:04 Sausage Pizza $11.00 #4Return1/2/98 12:00:03 Anchovy Pizza - $12.00 #dw1Sale1/2/98 12:00:01 Ham Pizza $10.00 #dw2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #dw3Sale1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys

11-34 Copyright  Oracle Corporation, All rights reserved. Summarizing Data During extraction on staging area After loading onto the warehouse server During extraction on staging area After loading onto the warehouse server Operationaldatabases Stagingarea Warehousedatabase a recorof as X + Customers: Browser: Hollywood

11-35 Copyright  Oracle Corporation, All rights reserved. Maintaining Transformation Metadata Contains transformation rules, algorithms, and routines Contains transformation rules, algorithms, and routines Browser: Hollywood X + Customers: a recorof as X + Customers: Browser: Hollywood Browser: Hollywood X + Sources Extract Stage Transform Rules Load Publish Query

11-36 Copyright  Oracle Corporation, All rights reserved. Maintaining Transformation Metadata Key restructuring Coding differences Multiple sources Exception rules Format differences Referential integrity fixes Aggregated data Key restructuring Coding differences Multiple sources Exception rules Format differences Referential integrity fixes Aggregated data

11-37 Copyright  Oracle Corporation, All rights reserved. Data Ownership and Responsibilities Operational and application development teams Data warehouse development team Business benefit gained with a one-team approach Operational and application development teams Data warehouse development team Business benefit gained with a one-team approach Browser: Hollywood X + Customers: a recorof as X + Customers: Browser: Hollywood Browser: Hollywood X +

11-38 Copyright  Oracle Corporation, All rights reserved. Transformation Timing and Location Transformation is performed: –Before load –In parallel May be initiated at different points Transformation is performed: –Before load –In parallel May be initiated at different points 12M m “12m65421” “ ” 12M m “12m65421” “ ” 12M M m m M M m UnlikelyProbablePossible

11-39 Copyright  Oracle Corporation, All rights reserved. Choosing a Transformation Point Workload Environment impact CPU use Disk space Workload Environment impact CPU use Disk space Network bandwidth Network bandwidth Parallel execution Parallel execution Load window time Load window time User information needs User information needs

11-40 Copyright  Oracle Corporation, All rights reserved. Monitoring and Tracking Transforms should: Be self-documenting Provide summary statistics Handle process exceptions Transforms should: Be self-documenting Provide summary statistics Handle process exceptions 12M m “12m65421” “ ” 12M m “12m65421” “ ” 12M M m m M M m ,200 1, ,001 20,890

11-41 Copyright  Oracle Corporation, All rights reserved. Designing Transformation Processes Analysis: –Sources and target mappings, business rules –Key users, metadata, grain Design options: PL/SQL, replication, custom, third-party tools Design issues: –Performance –Size of the staging area –Exception handling, integrity maintenance Analysis: –Sources and target mappings, business rules –Key users, metadata, grain Design options: PL/SQL, replication, custom, third-party tools Design issues: –Performance –Size of the staging area –Exception handling, integrity maintenance

11-42 Copyright  Oracle Corporation, All rights reserved. Transformation Tools Purchased SQL*Loader In-house developed Purchased SQL*Loader In-house developed

11-43 Copyright  Oracle Corporation, All rights reserved. Data Management, Quality and Auditing Tools Data management: –Innovative Systems –Postalsoft –Vality Technology Data quality and auditing: –Innovative Systems –Vality Technology Data management: –Innovative Systems –Postalsoft –Vality Technology Data quality and auditing: –Innovative Systems –Vality Technology

11-44 Copyright  Oracle Corporation, All rights reserved. Summary This lesson discussed the following topics: Importance of data quality Transformation process Data transformation issues Data anomalies Name and address management Tools This lesson discussed the following topics: Importance of data quality Transformation process Data transformation issues Data anomalies Name and address management Tools

11-45 Copyright  Oracle Corporation, All rights reserved. Practice 11-1 Overview This practice covers the following topics: Answering a series of short questions Specifying true or false to a series of statements This practice covers the following topics: Answering a series of short questions Specifying true or false to a series of statements