Download presentation
Presentation is loading. Please wait.
Published byDouglas Neal Modified over 8 years ago
1
Copyright Oracle Corporation, 1999. All rights reserved. 1111 Transforming Data
2
11-2 Copyright Oracle Corporation, 1999. All rights reserved. Overview Project Management (Methodology, Maintaining Metadata) Defining DW Concepts & Terminology Planning for a Successful Warehouse Analyzing User Query Needs Choosing a Computing Architecture Modeling the Data Warehouse Planning Warehouse Storage ETT (Building the Warehouse) Meeting a Business Need Supporting End User Access Managing the Data Warehouse
3
11-3 Copyright Oracle Corporation, 1999. All rights reserved. Objectives After completing this lesson, you should be able to do the following: Explain the importance of quality data Define the term “transformation” Identify transformation issues Describe techniques for transforming data List tools that can be used to transform data After completing this lesson, you should be able to do the following: Explain the importance of quality data Define the term “transformation” Identify transformation issues Describe techniques for transforming data List tools that can be used to transform data
4
11-4 Copyright Oracle Corporation, 1999. All rights reserved. Importance of Data Quality Summit Sports Hollywood Speedy Pizza Browser: http:// Hollywood X + Customers: a recorof as X + Customers: Browser: http:// Hollywood Browser: http:// Hollywood X +
5
11-5 Copyright Oracle Corporation, 1999. All rights reserved. Benefits of Quality Data Clean data is essential for: –Targeting customers –Determining buying patterns –Identifying householders: private and commercial –Matching customers –Identify historical data Dirty data must be removed. Clean data is essential for: –Targeting customers –Determining buying patterns –Identifying householders: private and commercial –Matching customers –Identify historical data Dirty data must be removed.
6
11-6 Copyright Oracle Corporation, 1999. All rights reserved. Standards Define a quality strategy. Decide on optimal data-quality level. Define a quality strategy. Decide on optimal data-quality level.
7
11-7 Copyright Oracle Corporation, 1999. All rights reserved. Quality Improvements Consider modifying rules for operational data. Document the sources. Create a data stewardship program. Design the cleanup process carefully. Initial cleanup and refresh routines may differ. Consider modifying rules for operational data. Document the sources. Create a data stewardship program. Design the cleanup process carefully. Initial cleanup and refresh routines may differ.
8
11-8 Copyright Oracle Corporation, 1999. All rights reserved. Guidelines Operational data should not be used directly in the warehouse Operational data must be cleaned for each increment Operational data is not simply fixed by modifying applications Operational data should not be used directly in the warehouse Operational data must be cleaned for each increment Operational data is not simply fixed by modifying applications
9
11-9 Copyright Oracle Corporation, 1999. All rights reserved. Solutions Conventional COBOL, 4GL Specialized tools Customized conversion process Business experts Conventional COBOL, 4GL Specialized tools Customized conversion process Business experts Investigation Conditioning Standardization Integration
10
11-10 Copyright Oracle Corporation, 1999. All rights reserved. Management Poor data quality Own Take responsibility Resolve problems Data quality manager Poor data quality Own Take responsibility Resolve problems Data quality manager
11
11-11 Copyright Oracle Corporation, 1999. All rights reserved. Transformation Transformation eliminates operational data anomalies Cleans Standardizes Presents subject-oriented data Transformation eliminates operational data anomalies Cleans Standardizes Presents subject-oriented data Extract Transform Operational system Warehouse Clean up Consolidate Restructure Data staging area Transport (Load)
12
11-12 Copyright Oracle Corporation, 1999. All rights reserved. Source Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies 90328575 Oracle Corp 100 NE 1st Street, Tampa 90328575 Oracle100 NE. First St., Tampa 90238475 Oracle Services 100 North East 1st St., FLA 90233479 Oracle Limited100 N.E. 1st St. 90233489 Oracle Computing15 Main Road, Ft. Lauderdale 90234889 Oracle Corp. UK15 Main Road, Ft. Lauderdale, FLA 90345672 Oracle Corp UK Ltd181 North Street, Key West, FLA CUSNUM NAME ADDRESS
13
11-13 Copyright Oracle Corporation, 1999. All rights reserved. Transformation Routines Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load
14
11-14 Copyright Oracle Corporation, 1999. All rights reserved. Transforming Data: Problems and Solutions Multipart keys CountrycodeSalesterritoryProductnumberSalespersoncode Product code = 12 M65431345
15
11-15 Copyright Oracle Corporation, 1999. All rights reserved. If field not in (‘m’,1,’male’) then … else if field is NULL else if field is NULL then … then … Transforming Data Multiple encoding Must pick up erroneous data Multiple encoding Must pick up erroneous data m, f 1, 0 male, female m, f mle, female 1, NULL
16
11-16 Copyright Oracle Corporation, 1999. All rights reserved. Transforming Data Multiple local standards Tools or filters to preprocess Multiple local standards Tools or filters to preprocess cm inches cm DD/MM/YY MM/DD/YY DD-Mon-YY 1,000 GBP FF 9,990 USD 600
17
11-17 Copyright Oracle Corporation, 1999. All rights reserved. Multiple Files Problem Added complexity of multiple source files Start simple Added complexity of multiple source files Start simple Extracteddata Multiple source files Logic to detect correct source
18
11-18 Copyright Oracle Corporation, 1999. All rights reserved. Transforming Data from Multiple Files File
19
11-19 Copyright Oracle Corporation, 1999. All rights reserved. Missing Values Problem Solution Ignore Wait Mark rows Extract when time-stamped Solution Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘A’ A
20
11-20 Copyright Oracle Corporation, 1999. All rights reserved. Duplicate Value Problem Solution SQL self-join techniques RDMBS constraint utilities Solution SQL self-join techniques RDMBS constraint utilities ACME Inc SELECT … FROM table_a, table_b WHERE table_a.key (+) = table_b.key UNION SELECT … FROM table_a, table_b WHERE table_a.key = table_b.key (+)
21
11-21 Copyright Oracle Corporation, 1999. All rights reserved. Solution CTAS SQL*Loader Solution CTAS SQL*Loader Element Names Problem Customer Browser: http:// Hollywood X + Customers: a recorof as X + Customers: Browser: http:// Hollywood Browser: http:// Hollywood X + 12345.00 12780.00 2345787.00 87877.98 5678.00 100% 110% 230% 200% -10% ABC CO GMBH LTD GBUK INC FFR ASSOC MCD CO Customer Client Contact Name
22
11-22 Copyright Oracle Corporation, 1999. All rights reserved. Element Meaning Problem Customer’s name All customer details All details except name a recorof as X + Customers: Browser: http:// Hollywood Customer_detail Avoid misinterpretation Complex solution Document meaning in metadata
23
11-23 Copyright Oracle Corporation, 1999. All rights reserved. Input Format Problem ASCIIEBCDIC 12373“123-73” ACME Co. áøåëéí äáàéíBeer (Pack of 8)
24
11-24 Copyright Oracle Corporation, 1999. All rights reserved. Referential Integrity Problem Solution SQL anti-join Server constraints Dedicated tools Solution SQL anti-join Server constraints Dedicated tools Department 10 20 30 40 Emp Name Department 1099Smith10 1289Jones20 1234Doe50 6786Harris60
25
11-25 Copyright Oracle Corporation, 1999. All rights reserved. Name and Address Problem No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines No unique key Missing values Personal and commercial names mixed Different addresses for same member Different names and spelling for same member Many names on one line One name on two lines Database 1 Database 2 DIANNE ZIEFELD N100 HARRY H. ENFIELDD589 FRED AND SARA MULLEN M300 NAMELOCATION ZIEFLED, DIANNE 100 ENFIELD, HARRY H 589 MULLEN, SARA AND FRED300
26
11-26 Copyright Oracle Corporation, 1999. All rights reserved. Name and Address Problem Single-field format Multiple-field format Multiple-field format Single-field format Multiple-field format Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 Name Mr. J. Smith Street100 Main St. TownBigtown CountyCounty Luth Code23565
27
11-27 Copyright Oracle Corporation, 1999. All rights reserved. Clean and Organize 1.Create atomic values. 2.Standardize formats. 3.Verify data accuracy. 4.Match with other records. 5. Identify private and commercial addresses and inhabitants. 6.Document in metadata. Requires sophisticated tools and techniques 1.Create atomic values. 2.Standardize formats. 3.Verify data accuracy. 4.Match with other records. 5. Identify private and commercial addresses and inhabitants. 6.Document in metadata. Requires sophisticated tools and techniques
28
11-28 Copyright Oracle Corporation, 1999. All rights reserved. Merging Data Operational transactions do not usually map one-to-one with warehouse data Data for the warehouse is merged to provide information for analysis Operational transactions do not usually map one-to-one with warehouse data Data for the warehouse is merged to provide information for analysis Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:02 Anchovy Pizza $12.00 Return1/2/98 12:00:03 Anchovy Pizza - $12.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00 Pizza sales/returns by day, hour, seconds
29
11-29 Copyright Oracle Corporation, 1999. All rights reserved. Merging Data a recorof as X + Customers: Browser: http:// Hollywood Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00 Sale1/2/98 12:00:02 Anchovy Pizza $12.00 Return1/2/98 12:00:03 Anchovy Pizza - $12.00 Sale1/2/98 12:00:01 Ham Pizza $10.00 Sale1/2/98 12:00:02 Cheese Pizza $15.00 Sale1/2/98 12:00:04 Sausage Pizza $11.00
30
11-30 Copyright Oracle Corporation, 1999. All rights reserved. Adding a Date Stamp Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data Enables time analysis Label loaded data with a date stamp Add time to fact and dimension data
31
11-31 Copyright Oracle Corporation, 1999. All rights reserved. Adding a Date Stamp Item Table Item_id Dept_id Time_key Time Table Week_id Period_id Year_id Time_key Store Table Store_id District_id Time_key Product Table Product_id Time_key Product_desc Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units
32
11-32 Copyright Oracle Corporation, 1999. All rights reserved. Adding a Date Stamp Fact table –Add triggers –Recode applications –Compare tables Dimension table Time representation –Point in time –Time span Fact table –Add triggers –Recode applications –Compare tables Dimension table Time representation –Point in time –Time span
33
11-33 Copyright Oracle Corporation, 1999. All rights reserved. Adding Keys to Data #1Sale1/2/98 12:00:01 Ham Pizza $10.00 #2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #3Sale1/2/98 12:00:02 Anchovy Pizza $12.00 #5Sale1/2/98 12:00:04 Sausage Pizza $11.00 #4Return1/2/98 12:00:03 Anchovy Pizza - $12.00 #dw1Sale1/2/98 12:00:01 Ham Pizza $10.00 #dw2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #dw3Sale1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys
34
11-34 Copyright Oracle Corporation, 1999. All rights reserved. Summarizing Data During extraction on staging area After loading onto the warehouse server During extraction on staging area After loading onto the warehouse server Operationaldatabases Stagingarea Warehousedatabase a recorof as X + Customers: Browser: http:// Hollywood
35
11-35 Copyright Oracle Corporation, 1999. All rights reserved. Maintaining Transformation Metadata Contains transformation rules, algorithms, and routines Contains transformation rules, algorithms, and routines Browser: http:// Hollywood X + Customers: a recorof as X + Customers: Browser: http:// Hollywood Browser: http:// Hollywood X + Sources Extract Stage Transform Rules Load Publish Query
36
11-36 Copyright Oracle Corporation, 1999. All rights reserved. Maintaining Transformation Metadata Key restructuring Coding differences Multiple sources Exception rules Format differences Referential integrity fixes Aggregated data Key restructuring Coding differences Multiple sources Exception rules Format differences Referential integrity fixes Aggregated data
37
11-37 Copyright Oracle Corporation, 1999. All rights reserved. Data Ownership and Responsibilities Operational and application development teams Data warehouse development team Business benefit gained with a one-team approach Operational and application development teams Data warehouse development team Business benefit gained with a one-team approach Browser: http:// Hollywood X + Customers: a recorof as X + Customers: Browser: http:// Hollywood Browser: http:// Hollywood X +
38
11-38 Copyright Oracle Corporation, 1999. All rights reserved. Transformation Timing and Location Transformation is performed: –Before load –In parallel May be initiated at different points Transformation is performed: –Before load –In parallel May be initiated at different points 12M65431 12-m-65421 “12m65421” “ ” 12M65431 12-m-65421 “12m65421” “ ” 12M65431 12 M m m 65431 65421 12 M M m 65431 65421 UnlikelyProbablePossible
39
11-39 Copyright Oracle Corporation, 1999. All rights reserved. Choosing a Transformation Point Workload Environment impact CPU use Disk space Workload Environment impact CPU use Disk space Network bandwidth Network bandwidth Parallel execution Parallel execution Load window time Load window time User information needs User information needs
40
11-40 Copyright Oracle Corporation, 1999. All rights reserved. Monitoring and Tracking Transforms should: Be self-documenting Provide summary statistics Handle process exceptions Transforms should: Be self-documenting Provide summary statistics Handle process exceptions 12M65431 12-m-65421 “12m65421” “ ” 12M65431 12-m-65421 “12m65421” “ ” 12M65431 12 M m m 65431 65421 12 M M m 65431 65421 1 2 3 4 5 1,200 1,400 100 6,001 20,890
41
11-41 Copyright Oracle Corporation, 1999. All rights reserved. Designing Transformation Processes Analysis: –Sources and target mappings, business rules –Key users, metadata, grain Design options: PL/SQL, replication, custom, third-party tools Design issues: –Performance –Size of the staging area –Exception handling, integrity maintenance Analysis: –Sources and target mappings, business rules –Key users, metadata, grain Design options: PL/SQL, replication, custom, third-party tools Design issues: –Performance –Size of the staging area –Exception handling, integrity maintenance
42
11-42 Copyright Oracle Corporation, 1999. All rights reserved. Transformation Tools Purchased SQL*Loader In-house developed Purchased SQL*Loader In-house developed
43
11-43 Copyright Oracle Corporation, 1999. All rights reserved. Data Management, Quality and Auditing Tools Data management: –Innovative Systems –Postalsoft –Vality Technology Data quality and auditing: –Innovative Systems –Vality Technology Data management: –Innovative Systems –Postalsoft –Vality Technology Data quality and auditing: –Innovative Systems –Vality Technology
44
11-44 Copyright Oracle Corporation, 1999. All rights reserved. Summary This lesson discussed the following topics: Importance of data quality Transformation process Data transformation issues Data anomalies Name and address management Tools This lesson discussed the following topics: Importance of data quality Transformation process Data transformation issues Data anomalies Name and address management Tools
45
11-45 Copyright Oracle Corporation, 1999. All rights reserved. Practice 11-1 Overview This practice covers the following topics: Answering a series of short questions Specifying true or false to a series of statements This practice covers the following topics: Answering a series of short questions Specifying true or false to a series of statements
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.