Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright  Oracle Corporation, 1999. All rights reserved. 1111 Transforming Data.

Similar presentations


Presentation on theme: "Copyright  Oracle Corporation, 1999. All rights reserved. 1111 Transforming Data."— Presentation transcript:

1 Copyright  Oracle Corporation, 1999. All rights reserved. 1111 Transforming Data

2 11-2 Copyright  Oracle Corporation, 1999. All rights reserved. Overview Project Management (Methodology, Maintaining Metadata) Defining DW Concepts & Terminology Planning for a Successful Warehouse Analyzing User Query Needs Choosing a Computing Architecture Modeling the Data Warehouse Planning Warehouse Storage ETT (Building the Warehouse) Meeting a Business Need Supporting End User Access Managing the Data Warehouse

3 11-3 Copyright  Oracle Corporation, 1999. All rights reserved. Objectives After completing this lesson, you should be able to do the following: Explain the importance of quality data Define the term “transformation” Identify transformation issues Describe techniques for transforming data List tools that can be used to transform data After completing this lesson, you should be able to do the following: Explain the importance of quality data Define the term “transformation” Identify transformation issues Describe techniques for transforming data List tools that can be used to transform data

4 11-4 Copyright  Oracle Corporation, 1999. All rights reserved. Importance of Data Quality Summit Sports Hollywood Speedy Pizza Browser: http:// Hollywood X + Customers: a recorof as X + Customers: Browser: http:// Hollywood Browser: http:// Hollywood X +

5 11-5 Copyright  Oracle Corporation, 1999. All rights reserved. Benefits of Quality Data Clean data is essential for: –Targeting customers –Determining buying patterns –Identifying householders: private and commercial –Matching customers –Identify historical data Dirty data must be removed. Clean data is essential for: –Targeting customers –Determining buying patterns –Identifying householders: private and commercial –Matching customers –Identify historical data Dirty data must be removed.

6 11-6 Copyright  Oracle Corporation, 1999. All rights reserved. Standards Define a quality strategy. Decide on optimal data-quality level. Define a quality strategy. Decide on optimal data-quality level.

7 11-7 Copyright  Oracle Corporation, 1999. All rights reserved. Quality Improvements Consider modifying rules for operational data. Document the sources. Create a data stewardship program. Design the cleanup process carefully. Initial cleanup and refresh routines may differ. Consider modifying rules for operational data. Document the sources. Create a data stewardship program. Design the cleanup process carefully. Initial cleanup and refresh routines may differ.

8 11-8 Copyright  Oracle Corporation, 1999. All rights reserved. Guidelines Operational data should not be used directly in the warehouse Operational data must be cleaned for each increment Operational data is not simply fixed by modifying applications Operational data should not be used directly in the warehouse Operational data must be cleaned for each increment Operational data is not simply fixed by modifying applications

9 11-9 Copyright  Oracle Corporation, 1999. All rights reserved. Solutions Conventional COBOL, 4GL Specialized tools Customized conversion process Business experts Conventional COBOL, 4GL Specialized tools Customized conversion process Business experts Investigation Conditioning Standardization Integration

10 11-10 Copyright  Oracle Corporation, 1999. All rights reserved. Management Poor data quality Own Take responsibility Resolve problems Data quality manager Poor data quality Own Take responsibility Resolve problems Data quality manager

11 11-11 Copyright  Oracle Corporation, 1999. All rights reserved. Transformation Transformation eliminates operational data anomalies Cleans Standardizes Presents subject-oriented data Transformation eliminates operational data anomalies Cleans Standardizes Presents subject-oriented data Extract Transform Operational system Warehouse Clean up Consolidate Restructure Data staging area Transport (Load)

12 11-12 Copyright  Oracle Corporation, 1999. All rights reserved. Source Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies 90328575 Oracle Corp 100 NE 1st Street, Tampa 90328575 Oracle100 NE. First St., Tampa 90238475 Oracle Services 100 North East 1st St., FLA 90233479 Oracle Limited100 N.E. 1st St. 90233489 Oracle Computing15 Main Road, Ft. Lauderdale 90234889 Oracle Corp. UK15 Main Road, Ft. Lauderdale, FLA 90345672 Oracle Corp UK Ltd181 North Street, Key West, FLA CUSNUM NAME ADDRESS

13 11-13 Copyright  Oracle Corporation, 1999. All rights reserved. Transformation Routines Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load

14 11-14 Copyright  Oracle Corporation, 1999. All rights reserved. Transforming Data: Problems and Solutions Multipart keys CountrycodeSalesterritoryProductnumberSalespersoncode Product code = 12 M65431345

15 11-15 Copyright  Oracle Corporation, 1999. All rights reserved. If field not in (‘m’,1,’male’) then … else if field is NULL else if field is NULL then … then … Transforming Data Multiple encoding Must pick up erroneous data Multiple encoding Must pick up erroneous data m, f 1, 0 male, female m, f mle, female 1, NULL

16 11-16 Copyright  Oracle Corporation, 1999. All rights reserved. Transforming Data Multiple local standards Tools or filters to preprocess Multiple local standards Tools or filters to preprocess cm inches cm DD/MM/YY MM/DD/YY DD-Mon-YY 1,000 GBP FF 9,990 USD 600

17 11-17 Copyright  Oracle Corporation, 1999. All rights reserved. Multiple Files Problem Added complexity of multiple source files Start simple Added complexity of multiple source files Start simple Extracteddata Multiple source files Logic to detect correct source

18 11-18 Copyright  Oracle Corporation, 1999. All rights reserved. Transforming Data from Multiple Files File

19 11-19 Copyright  Oracle Corporation, 1999. All rights reserved. Missing Values Problem Solution Ignore Wait Mark rows Extract when time-stamped Solution Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘A’ A

20 11-20 Copyright  Oracle Corporation, 1999. All rights reserved. Duplicate Value Problem Solution SQL self-join techniques RDMBS constraint utilities Solution SQL self-join techniques RDMBS constraint utilities ACME Inc SELECT … FROM table_a, table_b WHERE table_a.key (+) = table_b.key UNION SELECT … FROM table_a, table_b WHERE table_a.key = table_b.key (+)

21 11-21 Copyright  Oracle Corporation, 1999. All rights reserved. Solution CTAS SQL*Loader Solution CTAS SQL*Loader Element Names Problem Customer Browser: http:// Hollywood X + Customers: a recorof as X + Customers: Browser: http:// Hollywood Browser: http:// Hollywood X + 12345.00 12780.00 2345787.00 87877.98 5678.00 100% 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO Customer Client Contact Name

22 11-22 Copyright  Oracle Corporation, 1999. All rights reserved. Element Meaning Problem Customer’s name All customer details All details except name a recorof as X + Customers: Browser: http:// Hollywood Customer_detail Avoid misinterpretation Complex solution Document meaning in metadata

23 11-23 Copyright  Oracle Corporation, 1999. All rights reserved. Input Format Problem ASCIIEBCDIC 12373“123-73” ACME Co. áøåëéí äáàéíBeer (Pack of 8)

24 11-24 Copyright  Oracle Corporation, 1999. All rights reserved. Referential Integrity Problem Solution SQL anti-join Server constraints Dedicated tools Solution SQL anti-join Server constraints Dedicated tools Department 10 20 30 40 Emp Name Department 1099Smith10 1289Jones20 1234Doe50 6786Harris60

25 11-25 Copyright  Oracle Corporation, 1999. All rights reserved. Name and Address Problem No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines Database 1 Database 2 DIANNE ZIEFELD N100 HARRY H. ENFIELDD589 FRED AND SARA MULLEN M300 NAMELOCATION ZIEFLED, DIANNE 100 ENFIELD, HARRY H 589 MULLEN, SARA AND FRED300

26 11-26 Copyright  Oracle Corporation, 1999. All rights reserved. Name and Address Problem Single-field format Multiple-field format Multiple-field format Single-field format Multiple-field format Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 Name Mr. J. Smith Street100 Main St. TownBigtown CountyCounty Luth Code23565

27 11-27 Copyright  Oracle Corporation, 1999. All rights reserved. Clean and Organize 1.Create atomic values. 2.Standardize formats. 3.Verify data accuracy. 4.Match with other records. 5. Identify private and commercial addresses and inhabitants. 6.Document in metadata. Requires sophisticated tools and techniques 1.Create atomic values. 2.Standardize formats. 3.Verify data accuracy. 4.Match with other records. 5. Identify private and commercial addresses and inhabitants. 6.Document in metadata. Requires sophisticated tools and techniques

28 11-28 Copyright  Oracle Corporation, 1999. All rights reserved. Merging Data Operational transactions do not usually map one-to-one with warehouse data Data for the warehouse is merged to provide information for analysis Operational transactions do not usually map one-to-one with warehouse data Data for the warehouse is merged to provide information for analysis Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:02 Anchovy Pizza $12.00 Return1/2/98 12:00:03 Anchovy Pizza - $12.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00 Pizza sales/returns by day, hour, seconds

29 11-29 Copyright  Oracle Corporation, 1999. All rights reserved. Merging Data a recorof as X + Customers: Browser: http:// Hollywood Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00 Sale1/2/98 12:00:02 Anchovy Pizza $12.00 Return1/2/98 12:00:03 Anchovy Pizza - $12.00 Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00

30 11-30 Copyright  Oracle Corporation, 1999. All rights reserved. Adding a Date Stamp Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data

31 11-31 Copyright  Oracle Corporation, 1999. All rights reserved. Adding a Date Stamp Item Table Item_id Dept_id Time_key Time Table Week_id Period_id Year_id Time_key Store Table Store_id District_id Time_key Product Table Product_id Time_key Product_desc Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units

32 11-32 Copyright  Oracle Corporation, 1999. All rights reserved. Adding a Date Stamp Fact table –Add triggers –Recode applications –Compare tables Dimension table Time representation –Point in time –Time span Fact table –Add triggers –Recode applications –Compare tables Dimension table Time representation –Point in time –Time span

33 11-33 Copyright  Oracle Corporation, 1999. All rights reserved. Adding Keys to Data #1Sale1/2/98 12:00:01 Ham Pizza $10.00 #2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #3Sale1/2/98 12:00:02 Anchovy Pizza $12.00 #5Sale1/2/98 12:00:04 Sausage Pizza $11.00 #4Return1/2/98 12:00:03 Anchovy Pizza - $12.00 #dw1Sale1/2/98 12:00:01 Ham Pizza $10.00 #dw2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #dw3Sale1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys

34 11-34 Copyright  Oracle Corporation, 1999. All rights reserved. Summarizing Data During extraction on staging area After loading onto the warehouse server During extraction on staging area After loading onto the warehouse server Operationaldatabases Stagingarea Warehousedatabase a recorof as X + Customers: Browser: http:// Hollywood

35 11-35 Copyright  Oracle Corporation, 1999. All rights reserved. Maintaining Transformation Metadata Contains transformation rules, algorithms, and routines Contains transformation rules, algorithms, and routines Browser: http:// Hollywood X + Customers: a recorof as X + Customers: Browser: http:// Hollywood Browser: http:// Hollywood X + Sources Extract Stage Transform Rules Load Publish Query

36 11-36 Copyright  Oracle Corporation, 1999. All rights reserved. Maintaining Transformation Metadata Key restructuring Coding differences Multiple sources Exception rules Format differences Referential integrity fixes Aggregated data Key restructuring Coding differences Multiple sources Exception rules Format differences Referential integrity fixes Aggregated data

37 11-37 Copyright  Oracle Corporation, 1999. All rights reserved. Data Ownership and Responsibilities Operational and application development teams Data warehouse development team Business benefit gained with a one-team approach Operational and application development teams Data warehouse development team Business benefit gained with a one-team approach Browser: http:// Hollywood X + Customers: a recorof as X + Customers: Browser: http:// Hollywood Browser: http:// Hollywood X +

38 11-38 Copyright  Oracle Corporation, 1999. All rights reserved. Transformation Timing and Location Transformation is performed: –Before load –In parallel May be initiated at different points Transformation is performed: –Before load –In parallel May be initiated at different points 12M65431 12-m-65421 “12m65421” “ ” 12M65431 12-m-65421 “12m65421” “ ” 12M65431 12 M m m 65431 65421 12 M M m 65431 65421 UnlikelyProbablePossible

39 11-39 Copyright  Oracle Corporation, 1999. All rights reserved. Choosing a Transformation Point Workload Environment impact CPU use Disk space Workload Environment impact CPU use Disk space Network bandwidth Network bandwidth Parallel execution Parallel execution Load window time Load window time User information needs User information needs

40 11-40 Copyright  Oracle Corporation, 1999. All rights reserved. Monitoring and Tracking Transforms should: Be self-documenting Provide summary statistics Handle process exceptions Transforms should: Be self-documenting Provide summary statistics Handle process exceptions 12M65431 12-m-65421 “12m65421” “ ” 12M65431 12-m-65421 “12m65421” “ ” 12M65431 12 M m m 65431 65421 12 M M m 65431 65421 1 2 3 4 5 1,200 1,400 100 6,001 20,890

41 11-41 Copyright  Oracle Corporation, 1999. All rights reserved. Designing Transformation Processes Analysis: –Sources and target mappings, business rules –Key users, metadata, grain Design options: PL/SQL, replication, custom, third-party tools Design issues: –Performance –Size of the staging area –Exception handling, integrity maintenance Analysis: –Sources and target mappings, business rules –Key users, metadata, grain Design options: PL/SQL, replication, custom, third-party tools Design issues: –Performance –Size of the staging area –Exception handling, integrity maintenance

42 11-42 Copyright  Oracle Corporation, 1999. All rights reserved. Transformation Tools Purchased SQL*Loader In-house developed Purchased SQL*Loader In-house developed

43 11-43 Copyright  Oracle Corporation, 1999. All rights reserved. Data Management, Quality and Auditing Tools Data management: –Innovative Systems –Postalsoft –Vality Technology Data quality and auditing: –Innovative Systems –Vality Technology Data management: –Innovative Systems –Postalsoft –Vality Technology Data quality and auditing: –Innovative Systems –Vality Technology

44 11-44 Copyright  Oracle Corporation, 1999. All rights reserved. Summary This lesson discussed the following topics: Importance of data quality Transformation process Data transformation issues Data anomalies Name and address management Tools This lesson discussed the following topics: Importance of data quality Transformation process Data transformation issues Data anomalies Name and address management Tools

45 11-45 Copyright  Oracle Corporation, 1999. All rights reserved. Practice 11-1 Overview This practice covers the following topics: Answering a series of short questions Specifying true or false to a series of statements This practice covers the following topics: Answering a series of short questions Specifying true or false to a series of statements


Download ppt "Copyright  Oracle Corporation, 1999. All rights reserved. 1111 Transforming Data."

Similar presentations


Ads by Google