6 Copyright © 2006, Oracle. All rights reserved. The ETL Process: Transforming Data
Copyright © 2006, Oracle. All rights reserved Objectives After completing this lesson, you should be able to do the following: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data Describe techniques for transforming data Design transformation process List Oracle’s enhanced features and tools that can be used to transform data
Copyright © 2006, Oracle. All rights reserved Transformation Transformation eliminates anomalies from operational data: Cleans and standardizes Presents subject-oriented data Extract Warehouse Load Operational systems Data staging area Transform: Clean up Consolidate Restructure
Copyright © 2006, Oracle. All rights reserved Possible Staging Models Remote staging model Onsite staging model
Copyright © 2006, Oracle. All rights reserved Remote Staging Model Data staging area within the warehouse environment Data staging area in its own environment Load Warehouse Load Warehouse Operational system Extract Operational system Extract Transform Staging area Transform Staging area
Copyright © 2006, Oracle. All rights reserved Onsite Staging Model Data staging area within the operational environment, possibly affecting the operational system ExtractLoad Warehouse Operational system Transform Staging area
Copyright © 2006, Oracle. All rights reserved Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies CUSNUMNAMEADDRESS Oracle Limited100 N.E. 1st St Oracle Computing15 Main Road, Ft. Lauderdale Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA Oracle Corp UK Ltd181 North Street, Key West, FLA
Copyright © 2006, Oracle. All rights reserved Transformation Routines Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load
Copyright © 2006, Oracle. All rights reserved Transforming Data: Problems and Solutions Multipart keys Multiple local standards Multiple files Missing values Duplicate values Element names Element meanings Input formats Referential integrity constraints Name and address
Copyright © 2006, Oracle. All rights reserved Multipart Keys Problem Multipart keys Country code Sales territory Product number Salesperson code Product code = 12 M
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Multiple Local Standards Problem Multiple local standards Tools or filters to preprocess cm inches cmUSD 600 1,000 GBP FF 9,990 DD/MM/YY MM/DD/YY DD-Mon-YY
Copyright © 2006, Oracle. All rights reserved Multiple Files Problem Added complexity of multiple source files Start simple Transformed data Multiple source files Logic to detect correct source
Copyright © 2006, Oracle. All rights reserved Missing Values Problem Solution: Ignore Wait Mark rows Extract when timestamped If NULL, then field = “ A ” A
Copyright © 2006, Oracle. All rights reserved Duplicate Values Problem Solution: SQL self-join techniques RDBMS constraints SQL> SELECT... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);
Copyright © 2006, Oracle. All rights reserved Element Names Problem Solution: Common naming conventions Customer Client Contact Name
Copyright © 2006, Oracle. All rights reserved Element Meaning Problem Avoid misinterpretation Complex solution Document meaning in metadata Customer’s name Customer_detail All customer details All details except name
Copyright © 2006, Oracle. All rights reserved Input Format Problem ASCIIEBCDIC “ ” ACME Co. áøåëéí äáàéíBeer (Pack of 8)
Copyright © 2006, Oracle. All rights reserved Referential Integrity Problem Solution: SQL antijoin Server constraints Dedicated tools Department EmpNameDepartment 1099Smith Jones Doe Harris60
Copyright © 2006, Oracle. All rights reserved Name and Address Problem Single-field format Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, Database 1 NAMELOCATION DIANNE ZIEFELDN100 HARRY H. ENFIELDM300 Database 2 NAMELOCATION ZIEFELD, DIANNE100 ENFIELD, HARRY H300 NameMr. J. Smith Street100 Main St. TownBigtown CountryCounty Luth Code23565
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Name and Address Processing in Oracle Warehouse Builder Name and address mapping operator supports: Parsing Standardization Postal matching and geocoding
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Quality Data: Importance and Benefits Quality data: –Key to a successful warehouse implementation Quality data helps you in: –Targeting right customers –Determining buying patterns –Identifying householders: private and commercial –Matching customers –Identifying historical data
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Quality: Standards and Improvements Setting standards: –Define a quality strategy. –Decide on optimal data-quality level. Improving operational data quality: –Consider modifying rules for operational data. –Document the sources. –Create a data stewardship program. –Design the cleanup process carefully. –Initial cleanup and refresh routines may differ.
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Data Quality Guidelines Operational data: Should not be used directly in the warehouse Must be cleaned for each increment Is not fixed by modifying applications
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Data Quality: Solutions and Management Solutions: COBOL, Java, 4GL Specialized tools Customized data conversion process: –Investigation –Conditioning and standardization –Integration Management: Take responsibility. Resolve problems. Appoint a data quality manager.
Copyright © 2006, Oracle. All rights reserved Transformation Techniques Merging data Adding a date stamp Adding keys to data
Copyright © 2006, Oracle. All rights reserved Merging Data Operational transactions do not usually map one-to-one with warehouse data. Data for the warehouse is merged to provide information for analysis. Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:04 Sausage Pizza $11.00 Return1/2/02 12:00:03 Anchovy Pizza – $12.00 Sale1/2/02 12:00:02 Anchovy Pizza $12.00 Sale1/2/0212:00:01Ham Pizza $10.00 Pizza sales/returns by day, hour, seconds
Copyright © 2006, Oracle. All rights reserved Merging Data Pizza sales Sale1/2/0212:00:01Ham Pizza $10.00 Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:04 Sausage Pizza $11.00 Pizza sales/returns by day, hour, seconds Sale1/2/0212:00:01Ham Pizza $10.00 Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:02 Anchovy Pizza $12.00 Return1/2/02 12:00:03 Anchovy Pizza – $12.00 Sale1/2/02 12:00:04 Sausage Pizza $11.00
Copyright © 2006, Oracle. All rights reserved Adding a Date Stamp Time element can be represented as a: –Single point in time –Time span Add time element to: –Fact tables –Dimension data
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Adding a Date Stamp: Fact Tables and Dimensions ChannelsTable Channel_id Channel_name Time_key Customers Table Cust_id Cust_first_name Time_key Sales Item_id Store_id Time_key Sales_dollars Sales_units Times Table Week_id Period_id Year_id Time_key Products Table Product_id Time_key Product_desc
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Adding Keys to Data #1Sale1/2/98 12:00:01 Ham Pizza $10.00 #2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #3Sale1/2/98 12:00:02 Anchovy Pizza $12.00 #5Sale1/2/98 12:00:04 Sausage Pizza $11.00 #4Return1/2/98 12:00:03 Anchovy Pizza – $12.00 #dw1Sale1/2/98 12:00:01 Ham Pizza $10.00 #dw2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #dw3Sale1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys
Copyright © 2006, Oracle. All rights reserved Summarizing Data 1.During extraction on staging area 2.After loading to the warehouse server Operational databases Warehouse database Staging area
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Maintaining Transformation Metadata Transformation metadata contains: Transformation rules Algorithms and routines Sources Extract Stage Transform Rules Load Publish Query
Copyright © 2006, Oracle. All rights reserved Maintaining Transformation Metadata Restructure keys. Identify and resolve coding differences. Validate data from multiple sources. Handle exception rules. Identify and resolve format differences. Fix referential integrity inconsistencies. Identify summary data.
Copyright © 2006, Oracle. All rights reserved Data Ownership and Responsibilities Data ownership and responsibilities should be shared by the: –Operational team –Data warehouse team Business benefit gained with the “work together” approach
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Transformation Timing and Location Transformation is performed: –Before load –In parallel Can be initiated at different points: –On the operational platform –In a separate staging area
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Choosing a Transformation Point Workload Impact on environment CPU usage Disk space Network bandwidth Parallel execution Load window time User information needs
Copyright © 2006, Oracle. All rights reserved Monitoring and Tracking Transformations should: Be self-documenting Provide summary statistics Handle process exceptions
Copyright © 2006, Oracle. All rights reserved Designing Transformation Processes Analysis: –Sources and target mappings, business rules –Key users, metadata, grain Design options: –Tools (OWB) –Custom 3GL programs –4GLs such as SQL or PL/SQL –Replication Design issues: –Performance –Size of the staging area –Exception handling, integrity maintenance
Copyright © 2006, Oracle. All rights reserved Transformation Tools SQL*Loader Oracle Warehouse Builder(OWB) supports –Predefined transformations –Custom transformations
Copyright © 2006, Oracle. All rights reserved Oracle’s Enhanced Features for Transformation Transformation methods: Staging table 1 Staging table 2 Flat files Load into staging tables. Merge into warehouse tables. Multistage transformation Transform data. Validate data. Data warehouse
Copyright © 2006, Oracle. All rights reserved Oracle’s Enhanced Features for Transformation Transformation methods: Pipelined transformation External tables Flat files External table Table functions Transform data. Validate data. Merge into warehouse tables. Warehouse tables
Copyright © 2006, Oracle. All rights reserved Existing row updated New row inserted Oracle’s Enhanced Features for Transformation Transformation mechanisms using SQL: CREATE TABLES AS SELECT (CTAS) UPDATE MERGE Multitable INSERT CustCustomer MERGE
Copyright © 2006, Oracle. All rights reserved Application of the MERGE Statement in Data Warehousing An example: MERGE INTO customers C USING cust_src S ON (c.cust_id = s.src_cust_id) WHEN MATCHED THEN UPDATE SET c.cust_address = s.cust_address WHEN NOT MATCHED THEN INSERT ( cust_id, cust_first_name,…) VALUES (src_cust_id, src_first_name,…);
Copyright © 2006, Oracle. All rights reserved Multitable INSERT Statements Types: Unconditional INSERT Pivoting INSERT Conditional ALL INSERT Conditional FIRST INSERT Source table Condition Target table 1 Target table 2 Target table 3
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Advantages of Multitable INSERTs Eliminates the need for multiple INSERT…AS SELECT statements to populate multiple tables Eliminates the need for a procedure to perform multiple INSERTs using IF…THEN…ELSE syntax Significant performance improvement over the preceding two methods due to the elimination of the cost of repeated scans on the source data
Copyright © 2006, Oracle. All rights reserved Oracle’s Enhanced Features for Transformation Transformation mechanisms Using PL/SQL: –Used for complex transformations Using table functions. Table functions can: –Return multiple rows from a function –Accept results of multiple row SQL subqueries as input –Take cursors as input –Be parallelized –Support incremental pipelining
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved Advantages of PL/SQL Table Functions Table functions “pipeline” the results to the consuming process as soon as they are produced. Table functions can return multiple rows during each invocation (pipelining of data). Pipelining eliminates the need for buffering the produced rows.
Copyright © 2006, Oracle. All rights reserved Summary In this lesson, you should have learned how to: Define transformation Identify possible staging models Identify data anomalies and eliminate them Explain the importance of quality data Describe techniques for transforming data Design transformation process Describe Oracle’s enhanced features and tools that can be used to transform data
Copyright © 2006, Oracle. All rights reserved Practice 6-1: Overview This practice covers the following topics: Identifying the suitable staging model for RISD data warehouse Identifying the problems, and the best suited transformation techniques for the RISD data based on the given scenario Exploring the viewlet based demonstrations on ETL features of Oracle Warehouse Builder
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved
Copyright © 2006, Oracle. All rights reserved