Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation,

Slides:



Advertisements
Similar presentations
ICIS-NPDES Plugin Design Preview Webinar ICIS-NPDES Full Batch OpenNode2 Plugin Project Presented by Bill Rensmith Windsor Solutions, Inc. 3/15/2012.
Advertisements

Introduction to OWB(Oracle Warehouse Builder)
WAREHOUSING MANAGEMENT
BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
Building the Warehouse Chapter 10. Overview Defining DW Concepts & Terminology Planning For a Successful Warehouse Project Management (Methodology, Maintaining.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Chapter 3 Database Management
Designing the Data Warehouse and Data Mart Methodologies and Techniques.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Components and Architecture CS 543 – Data Warehousing.
Data Warehouse success depends on metadata
Chapter 4: Database Management. Databases Before the Use of Computers Data kept in books, ledgers, card files, folders, and file cabinets Long response.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
Chapter 13 The Data Warehouse
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
ETL Design and Development Michael A. Fudge, Jr.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
Integrate your people maximize your knowledge Tel SalesBase Customer.
ETL By Dr. Gabriel.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Database Systems – Data Warehousing
Data Warehouse Chapter 11. Multiple Files Problem Added complexity of multiple source files Start simple Multiple Source files Extracted data Logic to.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
© 2007 by Prentice Hall 1 Introduction to databases.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
1 Data Warehouses BUAD/American University Data Warehouses.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Building Data and Document-Driven Decision Support Systems How do managers access and use large databases of historical and external facts?
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Transportation: Loading Warehouse Data Chapter 12.
Sachin Goel (68) Manav Mudgal (69) Piyush Samsukha (76) Rachit Singhal (82) Richa Somvanshi (85) Sahar ( )
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Transportation: Refreshing Warehouse Data Chapter 13.
Chapter 11: Data Warehousing Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
7 Strategies for Extracting, Transforming, and Loading.
© 2003 Prentice Hall, Inc.3-1 Chapter 3 Database Management Information Systems Today Leonard Jessup and Joseph Valacich.
Advanced Database Concepts
1 Categories of data Operational and very short-term decision making data Current, short-term decision making, related to financial transactions, detailed.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
MIS 451 Building Business Intelligence Systems Data Staging.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
6 Copyright © 2006, Oracle. All rights reserved. The ETL Process: Transforming Data.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
C Copyright © 2007, Oracle. All rights reserved. Introduction to Data Warehousing Fundamentals.
Copyright  Oracle Corporation, All rights reserved Building the Warehouse.
Copyright  Oracle Corporation, All rights reserved Transforming Data.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Plan for Populating a DW
Introduction To DBMS.
Defining Data Warehouse Concepts and Terminology
Data Warehouse.
Defining Data Warehouse Concepts and Terminology
THE ARCHITECTURAL COMPONENTS
Data Warehousing Concepts
Presentation transcript:

Zhangxi Lin Texas Tech University ISQS 6339, Data Management & Business Intelligence 1 ISQS 6339, Data Management & Business Intelligence Extraction, Transformation, and Loading (II)

Agenda ISQS 6339, Data Management & Business Intelligence 2 I. Using SSIS for ETL Integration Services Learn by doing Package items Problem-oriented package development II. The Principle of ETL Extraction Transformation Loading

ISQS 6339, Data Management & Business Intelligence 3 II. The Principle of ETL

Structure and Components of Business Intelligence ISQS 6339, Data Management & Business Intelligence 4 SSMS SSIS SSAS SSRS SAS EM SAS EM SAS EG SAS EG

Automating your routine information processing tasks ISQS 6339, Data Management & Business Intelligence 5 Your routine information processing tasks Read online news at 8:00a and collect a few most important pieces Retrieve data from database to draft a short daily report at 10a View and reply s and take some notes that are saved in a database View 10 companies’ webpage to see the updates. Input the summaries into a database Browse three popular magazines twice a week. Input the summaries into a database Generate a few one-way frequency and two-way frequency tables and put them on the web Merge datasets collected by other people into a main database. Prepare a weekly report using the database and at 4p every Monday, and publish it to the internal portal site. Prepare a monthly report at 11a on the first day of a month, which must be converted into a pdf file and uploaded to the website. Seems there are many things are on going. How to handle them properly in the right time? Organizer – yes How about regular data processing tasks?

Information Processing and Information Flow ISQS 6339, Data Management & Business Intelligence 6 Transaction processing Interactions between a user and a computer application system with immediate responses from the application Operational processing Make use of computer to control a process Batch processing Consisting of a series of executions, each of which is applied to a set of data and turns the result to the next one. Analytical processing The interaction between analysts and collections of aggregated data that may have been reformulated into alternative representational forms for improved analytical performance.

Extraction, Transformation, Loading (ETL) Processes Extract source data Transform/clean data Index and summarize Load data into warehouse Detect changes Refresh data ETL Operational systems Data Warehouse Programs Tools Gateways 7 ISQS 6339, Data Management & Business Intelligence

ETL: Tasks, Importance, and Cost ISQS 6339, Data Management & Business Intelligence 8 Operational systems Relevant Useful Quality Accurate Accessible Data Warehouse ETL Extract Clean up Consolidate Restructure Load Maintain Refresh

Data mapping Transform Operational databases Data staging area Warehouse database Extracting Data Source systems Data from various data sources in various formats Extraction Routines Developed to select data fields from sources Consist of business rules, audit trails, error correction facilities 9 ISQS 6339, Data Management & Business Intelligence

Production Data Operating system platforms File systems Database systems and vertical applications IMS DB2 Oracle Sybase Informix VSAM SAP Shared Medical Systems Dun and Bradstreet Financials Hogan Financials Oracle Financials 10 ISQS 6339, Data Management & Business Intelligence

Archive Data Historical data Useful for analysis over long periods of time Useful for first-time load May require unique transformations Operation databases Warehouse database 11 ISQS 6339, Data Management & Business Intelligence

Internal Data Planning, sales, and marketing organization data Maintained in the form of: Spreadsheets (structured) Documents (unstructured) Treated like any other source data Warehouse database Planning Accounting Marketing 12 ISQS 6339, Data Management & Business Intelligence

External Data Information from outside the organization Issues of frequency, format, and predictability Described and tracked using metadata A.C. Nielsen, IRI, IMS, Walsh America Barron's Dun and Bradstreet Purchased databases Wall Street Journal Economic forecasts Competitive information Warehousing databases 13 ISQS 6339, Data Management & Business Intelligence

Possible ETL Failures A missing source file A system failure Inadequate metadata Poor mapping information Inadequate storage planning A source structural change No contingency plan Inadequate data validation 14 ISQS 6339, Data Management & Business Intelligence

Maintaining ETL Quality ETL must be: Tested Documented Monitored and reviewed Disparate metadata must be coordinated. 15 ISQS 6339, Data Management & Business Intelligence

Transformation Transformation eliminates anomalies from operational data: Cleans and standardizes Presents subject-oriented data Extract Warehouse Load Operational systems Data Staging Area Transform: Clean up Consolidate Restructure 16 ISQS 6339, Data Management & Business Intelligence

Remote Staging Model Load Warehouse Load Warehouse Data staging area within the warehouse environment Data staging area in its own environment Operational system Extract Operational system Extract Transform Staging area Transform Staging area 17 ISQS 6339, Data Management & Business Intelligence

On-site Staging Model Data staging area within the operational environment, possibly affecting the operational system ExtractLoad Warehouse Operational system Transform Staging area 18 ISQS 6339, Data Management & Business Intelligence

Data Anomalies No unique key Data naming and coding anomalies Data meaning anomalies between groups Spelling and text inconsistencies CUSNUMNAMEADDRESS Oracle Limited100 N.E. 1st St Oracle Computing15 Main Road, Ft. Lauderdale Oracle Corp. UK15 Main Road, Ft. Lauderdale, FLA Oracle Corp UK Ltd181 North Street, Key West, FLA 19 ISQS 6339, Data Management & Business Intelligence

Transformation Routines Cleaning data Eliminating inconsistencies Adding elements Merging data Integrating data Transforming data before load 20 ISQS 6339, Data Management & Business Intelligence

Transforming Data: Problems and Solutions ISQS 6339, Data Management & Business Intelligence 21 Multipart keys Multiple local standards Multiple files Missing values Duplicate values Element names Element meanings Input formats Referential Integrity constraints Name and address

Multipart Keys Problem Multipart keys Country code Sales territory Product number Salesperson code Product code = 12 M ISQS 6339, Data Management & Business Intelligence

Multiple Local Standards Problem Multiple local standards Tools or filters to preprocess cm inches cmUSD 600 1,000 GBP FF 9,990 DD/MM/YY MM/DD/YY DD-Mon-YY 23 ISQS 6339, Data Management & Business Intelligence

Multiple Files Problem Added complexity of multiple source files Start simple Transformed data Multiple source files Logic to detect correct source 24 ISQS 6339, Data Management & Business Intelligence

Missing Values Problem Solution: Ignore Wait Mark rows Extract when time-stamped If NULL then field = ‘ A ’ A 25 ISQS 6339, Data Management & Business Intelligence

Duplicate Values Problem Solution: SQL self-join techniques RDMBS constraint utilities ACME Inc SQL> SELECT... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+); 26 ISQS 6339, Data Management & Business Intelligence

Element Names Problem Solution: Common naming conventions Customer Client Contact Name 27 ISQS 6339, Data Management & Business Intelligence

Element Meaning Problem Avoid misinterpretation Complex solution Document meaning in metadata Customer’s name Customer_detail All customer details All details except name 28 ISQS 6339, Data Management & Business Intelligence

Input Format Problem ASCIIEBCDIC “ ” ACME Co. áøåëéí äáàéíBeer (Pack of 8) 29 ISQS 6339, Data Management & Business Intelligence

Referential Integrity Problem Solution: SQL anti-join Server constraints Dedicated tools Departme nt EmpNameDepartment 1099Smith Jones Doe Harris60 30 ISQS 6339, Data Management & Business Intelligence

Name and Address Problem Single-field format Multiple-field format Mr. J. Smith,100 Main St., Bigtown, County Luth, Database 1 NAMELOCATION DIANNE ZIEFELDN100 HARRY H. ENFIELDM300 Database 2 NAMELOCATION ZIEFELD, DIANNE100 ENFIELD, HARRY H300 NameMr. J. Smith Street100 Main St. TownBigtown CountryCounty Luth Code ISQS 6339, Data Management & Business Intelligence

Quality Data: Importance and Benefits  Quality data: ◦ Key to a successful warehouse implementation  Quality data helps you in: ◦ Targeting right customers ◦ Determining buying patterns ◦ Identifying householders: private and commercial ◦ Matching customers ◦ Identify historical data 32 ISQS 6339, Data Management & Business Intelligence

Data Quality Guidelines Operational data: Should not be used directly in the warehouse Must be cleaned for each increment Is not simply fixed by modifying applications 33 ISQS 6339, Data Management & Business Intelligence

Transformation Techniques ISQS 6339, Data Management & Business Intelligence 34 Merging data Adding a Date Stamp Adding Keys to Data

Merging Data Operational transactions do not usually map one-to-one with warehouse data. Data for the warehouse is merged to provide information for analysis. Pizza sales/returns by day, hour, seconds Sale1/2/0212:00:01Ham Pizza $10.00 Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:02 Anchovy Pizza $12.00 Return1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale1/2/02 12:00:04 Sausage Pizza $ ISQS 6339, Data Management & Business Intelligence

Merging Data Pizza sales Sale1/2/0212:00:01Ham Pizza $10.00 Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:04 Sausage Pizza $11.00 Pizza sales/returns by day, hour, seconds Sale1/2/0212:00:01Ham Pizza $10.00 Sale1/2/0212:00:02 Cheese Pizza $15.00 Sale1/2/02 12:00:02 Anchovy Pizza $12.00 Return1/2/02 12:00:03 Anchovy Pizza - $12.00 Sale1/2/02 12:00:04 Sausage Pizza $ ISQS 6339, Data Management & Business Intelligence

Adding a Date Stamp  Time element can be represented as a: ◦ Single point in time ◦ Time span  Add time element to: ◦ Fact tables ◦ Dimension data 37 ISQS 6339, Data Management & Business Intelligence

Adding a Date Stamp: Fact Tables and Dimensions Item Table Item_id Dept_id Time_key Store Table Store_id District_id Time_key Sales Fact Table Item_id Store_id Time_key Sales_dollars Sales_units Time Table Week_id Period_id Year_id Time_key Product Table Product_id Time_key Product_desc 38 ISQS 6339, Data Management & Business Intelligence

Adding Keys to Data #1Sale1/2/98 12:00:01 Ham Pizza $10.00 #2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #3Sale1/2/98 12:00:02 Anchovy Pizza $12.00 #5Sale1/2/98 12:00:04 Sausage Pizza $11.00 #4Return1/2/98 12:00:03 Anchovy Pizza - $12.00 #dw1Sale1/2/98 12:00:01 Ham Pizza $10.00 #dw2Sale1/2/98 12:00:02 Cheese Pizza $15.00 #dw3Sale1/2/98 12:00:04 Sausage Pizza $11.00 Data values or artificial keys 39 ISQS 6339, Data Management & Business Intelligence

Summarizing Data 1.During extraction on staging area 2.After loading to the warehouse server Operational databases Warehouse database Staging area 40 ISQS 6339, Data Management & Business Intelligence

Maintaining Transformation Metadata  Transformation metadata contains: ◦ Transformation rules ◦ Algorithms and routines Sources Extract Stage Transform Rules Load Publish Query 41 ISQS 6339, Data Management & Business Intelligence

Maintaining Transformation Metadata Restructure keys Identify and resolve coding differences Validate data from multiple sources Handle exception rules Identify and resolve format differences Fix referential integrity inconsistencies Identify summary data 42 ISQS 6339, Data Management & Business Intelligence

Transformation Timing and Location ◦ Transformation is performed:  Before load  In parallel ◦ Can be initiated at different points:  On the operational platform  In a separate staging area 43 ISQS 6339, Data Management & Business Intelligence

Monitoring and Tracking Transformations should: Be self-documenting Provide summary statistics Handle process exceptions 44 ISQS 6339, Data Management & Business Intelligence

Loading Data into the Warehouse Loading moves the data into the warehouse Loading can be time-consuming: Consider the load window Schedule and automate the loading Initial load moves large volumes of data Subsequent refresh moves smaller volumes of data Operational databases Warehouse database Staging area Extract Transform Transport, Load 45 ISQS 6339, Data Management & Business Intelligence

Initial Load and Refresh Initial Load: Single event that populates the database with historical data Involves large volumes of data Employs distinct ETL tasks Involves large amounts of processing after load Refresh: Performed according to a business cycle Less data to load than first-time load Less-complex ETL tasks Smaller amounts of post-load processing 46 ISQS 6339, Data Management & Business Intelligence

Data Refresh Models: Extract Processing Environment After each time interval, build a new snapshot of the database. Purge old snap shots. T1T2T3 Operational databases 47 ISQS 6339, Data Management & Business Intelligence

Data Refresh Models: Warehouse Processing Environment Build a new database. After each time interval, add changes to database. Archive or purge oldest data. T1T2T3 Operational databases 48 ISQS 6339, Data Management & Business Intelligence

Building the Loading Process Techniques and tools File transfer methods The load window Time window for other tasks First-time and refresh volumes Frequency of the refresh cycle Connectivity bandwidth 49 ISQS 6339, Data Management & Business Intelligence

Building the Loading Process Test the proposed technique Document proposed load Monitor, review, and revise 50 ISQS 6339, Data Management & Business Intelligence

Data Granularity Important design and operational issue Low-level grain: Expensive, high level of processing, more disk space, more details High-level grain: Cheaper, less processing, less disk space, little details 51 ISQS 6339, Data Management & Business Intelligence

Loading Techniques Tools Utilities and 3GL Gateways Customized copy programs Replication FTP Manual 52 ISQS 6339, Data Management & Business Intelligence

Loading Technique Considerations Tools are comprehensive, but costly. Data-movement utilities are fast and powerful. Gateways are suitable for specific instances: Access other databases Supply dependent data marts Support a distributed environment Provide real-time access if needed Use customized programs as a last resort. Replication is limited by data-transfer rates. 53 ISQS 6339, Data Management & Business Intelligence

Post-Processing of Loaded Data Post-processing of loaded data Create indexes Generate keys SummarizeFilter Extract Transform Load WarehouseStaging area 54 ISQS 6339, Data Management & Business Intelligence

Creating Derived Keys The use of derived or generalized keys is recommended to maintain the uniqueness of a row. Methods: Concatenate operational key with a number Assign a number sequentially from a list ISQS 6339, Data Management & Business Intelligence

Summary Management Summary tables Materialized views Summary data 56 ISQS 6339, Data Management & Business Intelligence

Filtering Data From warehouse to data marts Data marts Summary data Warehouse 57 ISQS 6339, Data Management & Business Intelligence

Verifying Data Integrity Load data into intermediate file. Compare target flash totals with totals before load. Target = = Load Preserve, inspect, fix, then load Counts & Amounts Flash Totals Counts & Amounts Flash Totals Intermediate file 58 ISQS 6339, Data Management & Business Intelligence

Steps for Verifying Data Integrity Target Source files Control Extract SQL*Loader 4.log 7.bad ISQS 6339, Data Management & Business Intelligence

Standard Quality Assurance Checks Load status Completion of the process Completeness of the data Data reconciliation Referential integrity violations Reprocessing Comparison of counts and amounts = 3 60 ISQS 6339, Data Management & Business Intelligence