Download presentation
Presentation is loading. Please wait.
Published byChristopher Strickland Modified over 9 years ago
1
Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, and Loading (II)
2
Agenda ISQS 6339, Data Management & Business Intelligence 2 I. Using SSIS for ETL Integration Services Learn by doing Package items Problem-oriented package development II. The Principle of ETL Extraction Transformation Loading
3
ISQS 6339, Data Management & Business Intelligence 3 II. The Principle of ETL
4
Structure and Components of Business Intelligence ISQS 6339, Data Management & Business Intelligence 4 SSMS SSIS SSAS SSRS SAS EM SAS EM SAS EG SAS EG
5
Automating your routine information processing tasks ISQS 6339, Data Management & Business Intelligence 5 Your routine information processing tasks Read online news at 8:00a and collect a few most important pieces Retrieve data from database to draft a short daily report at 10a View and reply emails and take some notes that are saved in a database View 10 companies’ webpage to see the updates. Input the summaries into a database Browse three popular magazines twice a week. Input the summaries into a database Generate a few one-way frequency and two-way frequency tables and put them on the web Merge datasets collected by other people into a main database. Prepare a weekly report using the database and at 4p every Monday, and publish it to the internal portal site. Prepare a monthly report at 11a on the first day of a month, which must be converted into a pdf file and uploaded to the website. Seems there are many things are on going. How to handle them properly in the right time? Organizer – yes How about regular data processing tasks?
6
Information Processing and Information Flow ISQS 6339, Data Management & Business Intelligence 6 Transaction processing Interactions between a user and a computer application system with immediate responses from the application Operational processing Make use of computer to control a process Batch processing Consisting of a series of executions, each of which is applied to a set of data and turns the result to the next one. Analytical processing The interaction between analysts and collections of aggregated data that may have been reformulated into alternative representational forms for improved analytical performance.
7
Extraction, Transformation, Loading (ETL) Processes Extract source data Transform/clean data Index and summarize Load data into warehouse Detect changes Refresh data ETL Operational systems Data Warehouse Programs Tools Gateways 7 ISQS 6339, Data Management & Business Intelligence
8
ETL: Tasks, Importance, and Cost ISQS 6339, Data Management & Business Intelligence 8 Operational systems Relevant Useful Quality Accurate Accessible Data Warehouse ETL Extract Clean up Consolidate Restructure Load Maintain Refresh
9
Data mapping Transform Operational databases Data staging area Warehouse database Extracting Data Source systems Data from various data sources in various formats Extraction Routines Developed to select data fields from sources Consist of business rules, audit trails, error correction facilities 9 ISQS 6339, Data Management & Business Intelligence
10
Production Data Operating system platforms File systems Database systems and vertical applications IMS DB2 Oracle Sybase Informix VSAM SAP Shared Medical Systems Dun and Bradstreet Financials Hogan Financials Oracle Financials 10 ISQS 6339, Data Management & Business Intelligence
11
Archive Data Historical data Useful for analysis over long periods of time Useful for first-time load May require unique transformations Operation databases Warehouse database 11 ISQS 6339, Data Management & Business Intelligence
12
Internal Data Planning, sales, and marketing organization data Maintained in the form of: Spreadsheets (structured) Documents (unstructured) Treated like any other source data Warehouse database Planning Accounting Marketing 12 ISQS 6339, Data Management & Business Intelligence
13
External Data Information from outside the organization Issues of frequency, format, and predictability Described and tracked using metadata A.C. Nielsen, IRI, IMS, Walsh America Barron's Dun and Bradstreet Purchased databases Wall Street Journal Economic forecasts Competitive information Warehousing databases 13 ISQS 6339, Data Management & Business Intelligence
14
Possible ETL Failures A missing source file A system failure Inadequate metadata Poor mapping information Inadequate storage planning A source structural change No contingency plan Inadequate data validation 14 ISQS 6339, Data Management & Business Intelligence
15
Maintaining ETL Quality ETL must be: Tested Documented Monitored and reviewed Disparate metadata must be coordinated. 15 ISQS 6339, Data Management & Business Intelligence
16
Transformation Transformation eliminates anomalies from operational data: Cleans and standardizes Presents subject-oriented data Extract Warehouse Load Operational systems Data Staging Area Transform: Clean up Consolidate Restructure 16 ISQS 6339, Data Management & Business Intelligence
17
Remote Staging Model Load Warehouse Load Warehouse Data staging area within the warehouse environment Data staging area in its own environment Operational system Extract Operational system Extract Transform Staging area Transform Staging area 17 ISQS 6339, Data Management & Business Intelligence
18
On-site Staging Model Data staging area within the operational environment, possibly affecting the operational system ExtractLoad Warehouse Operational system Transform Staging area 18 ISQS 6339, Data Management & Business Intelligence
19
Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies CUSNUMNAMEADDRESS 90233479Oracle Limited100 N.E. 1st St. 90233489Oracle Computing15 Main Road, Ft. Lauderdale 90234889Oracle Corp. UK15 Main Road, Ft. Lauderdale, FLA 90345672Oracle Corp UK Ltd181 North Street, Key West, FLA 19 ISQS 6339, Data Management & Business Intelligence
20
Transformation Routines Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load 20 ISQS 6339, Data Management & Business Intelligence
21
Transforming Data: Problems and Solutions ISQS 6339, Data Management & Business Intelligence 21 Multipart keys Multiple local standards Multiple files Missing values Duplicate values Element names Element meanings Input formats Referential Integrity constraints Name and address
22
Multipart Keys Problem Multipart keys Country code Sales territory Product number Salesperson code Product code = 12 M 654313 45 22 ISQS 6339, Data Management & Business Intelligence
23
Multiple Local Standards Problem Multiple local standards Tools or filters to preprocess cm inches cmUSD 600 1,000 GBP FF 9,990 DD/MM/YY MM/DD/YY DD-Mon-YY 23 ISQS 6339, Data Management & Business Intelligence
24
Multiple Files Problem Added complexity of multiple source files Start simple Transformed data Multiple source files Logic to detect correct source 24 ISQS 6339, Data Management & Business Intelligence
25
Missing Values Problem Solution: Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘ A ’ A 25 ISQS 6339, Data Management & Business Intelligence
26
Duplicate Values Problem Solution: SQL self-join techniques RDMBS constraint utilities ACME Inc SQL> SELECT... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+); 26 ISQS 6339, Data Management & Business Intelligence
27
Element Names Problem Solution: Common naming conventions Customer Client Contact Name 27 ISQS 6339, Data Management & Business Intelligence
28
Element Meaning Problem Avoid misinterpretation Complex solution Document meaning in metadata Customer’s name Customer_detail All customer details All details except name 28 ISQS 6339, Data Management & Business Intelligence
29
Input Format Problem ASCIIEBCDIC 12373 “ 123-73 ” ACME Co. áøåëéí äáàéíBeer (Pack of 8) 29 ISQS 6339, Data Management & Business Intelligence
30
Referential Integrity Problem Solution: SQL anti-join Server constraints Dedicated tools Departme nt 10 20 30 40 EmpNameDepartment 1099Smith10 1289Jones20 1234Doe50 6786Harris60 30 ISQS 6339, Data Management & Business Intelligence
31
Name and Address Problem Single-field format Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565 Database 1 NAMELOCATION DIANNE ZIEFELDN100 HARRY H. ENFIELDM300 Database 2 NAMELOCATION ZIEFELD, DIANNE100 ENFIELD, HARRY H300 NameMr. J. Smith Street100 Main St. TownBigtown CountryCounty Luth Code23565 31 ISQS 6339, Data Management & Business Intelligence
32
Quality Data: Importance and Benefits Quality data: ◦ Key to a successful warehouse implementation Quality data helps you in: ◦ Targeting right customers ◦ Determining buying patterns ◦ Identifying householders: private and commercial ◦ Matching customers ◦ Identify historical data 32 ISQS 6339, Data Management & Business Intelligence
33
Data Quality Guidelines Operational data: Should not be used directly in the warehouse Must be cleaned for each increment Is not simply fixed by modifying applications 33 ISQS 6339, Data Management & Business Intelligence
34
Transformation Techniques ISQS 6339, Data Management & Business Intelligence 34 Merging data Adding a Date Stamp Adding Keys to Data
35
Merging Data Operational transactions do not usually map one-to-one with warehouse data. Data for the warehouse is merged to provide information for analysis. Pizza sales/returns by day, hour, seconds Sale1/2/0212:00:01Ham Pizza $10.00 Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:02 Anchovy Pizza $12.00 Return1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale1/2/02 12:00:04 Sausage Pizza $11.00 35 ISQS 6339, Data Management & Business Intelligence
36
Merging Data Pizza sales Sale1/2/0212:00:01Ham Pizza $10.00 Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:04 Sausage Pizza $11.00 Pizza sales/returns by day, hour, seconds Sale1/2/0212:00:01Ham Pizza $10.00 Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:02 Anchovy Pizza $12.00 Return1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale1/2/02 12:00:04 Sausage Pizza $11.00 36 ISQS 6339, Data Management & Business Intelligence
37
Adding a Date Stamp Time element can be represented as a: ◦ Single point in time ◦ Time span Add time element to: ◦ Fact tables ◦ Dimension data 37 ISQS 6339, Data Management & Business Intelligence
38
Adding a Date Stamp: Fact Tables and Dimensions Item Table Item_id Dept_id Time_key Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Product Table Product_id Time_key Product_desc 38 ISQS 6339, Data Management & Business Intelligence
39
Adding Keys to Data #1Sale1/2/98 12:00:01 Ham Pizza $10.00 #2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #3Sale1/2/98 12:00:02 Anchovy Pizza $12.00 #5Sale1/2/98 12:00:04 Sausage Pizza $11.00 #4Return1/2/98 12:00:03 Anchovy Pizza - $12.00 #dw1Sale1/2/98 12:00:01 Ham Pizza $10.00 #dw2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #dw3Sale1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys 39 ISQS 6339, Data Management & Business Intelligence
40
Summarizing Data 1.During extraction on staging area 2.After loading to the warehouse server Operational databases Warehouse database Staging area 40 ISQS 6339, Data Management & Business Intelligence
41
Maintaining Transformation Metadata Transformation metadata contains: ◦ Transformation rules ◦ Algorithms and routines Sources Extract Stage Transform Rules Load Publish Query 41 ISQS 6339, Data Management & Business Intelligence
42
Maintaining Transformation Metadata Restructure keys Identify and resolve coding differences Validate data from multiple sources Handle exception rules Identify and resolve format differences Fix referential integrity inconsistencies Identify summary data 42 ISQS 6339, Data Management & Business Intelligence
43
Transformation Timing and Location ◦ Transformation is performed: Before load In parallel ◦ Can be initiated at different points: On the operational platform In a separate staging area 43 ISQS 6339, Data Management & Business Intelligence
44
Monitoring and Tracking Transformations should: Be self-documenting Provide summary statistics Handle process exceptions 44 ISQS 6339, Data Management & Business Intelligence
45
Loading Data into the Warehouse Loading moves the data into the warehouse Loading can be time-consuming: Consider the load window Schedule and automate the loading Initial load moves large volumes of data Subsequent refresh moves smaller volumes of data Operational databases Warehouse database Staging area Extract Transform Transport, Load 45 ISQS 6339, Data Management & Business Intelligence
46
Initial Load and Refresh Initial Load: Single event that populates the database with historical data Involves large volumes of data Employs distinct ETL tasks Involves large amounts of processing after load Refresh: Performed according to a business cycle Less data to load than first-time load Less-complex ETL tasks Smaller amounts of post-load processing 46 ISQS 6339, Data Management & Business Intelligence
47
Data Refresh Models: Extract Processing Environment After each time interval, build a new snapshot of the database. Purge old snap shots. T1T2T3 Operational databases 47 ISQS 6339, Data Management & Business Intelligence
48
Data Refresh Models: Warehouse Processing Environment Build a new database. After each time interval, add changes to database. Archive or purge oldest data. T1T2T3 Operational databases 48 ISQS 6339, Data Management & Business Intelligence
49
Building the Loading Process Techniques and tools File transfer methods The load window Time window for other tasks First-time and refresh volumes Frequency of the refresh cycle Connectivity bandwidth 49 ISQS 6339, Data Management & Business Intelligence
50
Building the Loading Process Test the proposed technique Document proposed load Monitor, review, and revise 50 ISQS 6339, Data Management & Business Intelligence
51
Data Granularity Important design and operational issue Low-level grain: Expensive, high level of processing, more disk space, more details High-level grain: Cheaper, less processing, less disk space, little details 51 ISQS 6339, Data Management & Business Intelligence
52
Loading Techniques Tools Utilities and 3GL Gateways Customized copy programs Replication FTP Manual 52 ISQS 6339, Data Management & Business Intelligence
53
Loading Technique Considerations Tools are comprehensive, but costly. Data-movement utilities are fast and powerful. Gateways are suitable for specific instances: Access other databases Supply dependent data marts Support a distributed environment Provide real-time access if needed Use customized programs as a last resort. Replication is limited by data-transfer rates. 53 ISQS 6339, Data Management & Business Intelligence
54
Post-Processing of Loaded Data Post-processing of loaded data Create indexes Generate keys SummarizeFilter Extract Transform Load WarehouseStaging area 54 ISQS 6339, Data Management & Business Intelligence
55
Creating Derived Keys The use of derived or generalized keys is recommended to maintain the uniqueness of a row. Methods: Concatenate operational key with a number Assign a number sequentially from a list 109908 01 109908 100 55 ISQS 6339, Data Management & Business Intelligence
56
Summary Management Summary tables Materialized views Summary data 56 ISQS 6339, Data Management & Business Intelligence
57
Filtering Data From warehouse to data marts Data marts Summary data Warehouse 57 ISQS 6339, Data Management & Business Intelligence
58
Verifying Data Integrity Load data into intermediate file. Compare target flash totals with totals before load. Target = = Load Preserve, inspect, fix, then load Counts & Amounts Flash Totals Counts & Amounts Flash Totals Intermediate file 58 ISQS 6339, Data Management & Business Intelligence
59
Steps for Verifying Data Integrity Target Source files Control Extract SQL*Loader 4.log 7.bad 56 2 3 1 59 ISQS 6339, Data Management & Business Intelligence
60
Standard Quality Assurance Checks Load status Completion of the process Completeness of the data Data reconciliation Referential integrity violations Reprocessing Comparison of counts and amounts 1 + 1 = 3 60 ISQS 6339, Data Management & Business Intelligence
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.