ETL Concepts.

Slides:



Advertisements
Similar presentations
IS 4420 Database Fundamentals Chapter 11: Data Warehousing Leon Chen
Advertisements

C6 Databases.
Data Integration Combining data from different sources, providing a unified view of the data Combining data from different sources, providing a unified.
Chapter 10: data Quality and Integration
Dr. Chen, Data Base Management Chapter 10: Data Quality and Integration Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga.
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
Managing Data Resources
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Database – Part 2b Dr. V.T. Raja Oregon State University External References/Sources: Data Warehousing – Sakthi Angappamudali at Standard Insurance; BI.
Components and Architecture CS 543 – Data Warehousing.
Data Warehouse success depends on metadata
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Data Warehouse Components
Designing a Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Chapter 4 Data Warehousing.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University.
ETL By Dr. Gabriel.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Agenda 02/20/2014 Complete data warehouse design exercise Finish reconciled data warehouse, bus matrix and data mart Display each group’s work Discuss.
Data Warehouse Tools and Technologies - ETL
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Designing a Data Warehouse Issues in DW design. Three Fundamental Processes Data Acquisition Data Storage Data a Access.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Understanding Data Warehousing
Database Systems – Data Warehousing
MBA 664 Database Management Systems Dave Salisbury ( )
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie.
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
© 2007 by Prentice Hall 1 Introduction to databases.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
1 Data Warehouses BUAD/American University Data Warehouses.
1 Data Warehousing. 2Definition Data Warehouse Data Warehouse: – A subject-oriented, integrated, time-variant, non- updatable collection of data used.
Data Warehousing.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Building Marketing Databases. In-House or Outside Bureau? Outside Bureau: Outside agency that specializes in designing and developing customized databases.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Data resource management
Database Management System Prepared by Dr. Ahmed El-Ragal Reviewed & Presented By Mr. Mahmoud Rafeek Alfarra College Of Science & Technology- Khan younis.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Chapter 11: Data Warehousing Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
7 Strategies for Extracting, Transforming, and Loading.
Carnegie Mellon University © Robert T. Monroe Management Information Systems Data Warehousing Management Information Systems Robert.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
Copyright  Oracle Corporation, All rights reserved Building the Warehouse.
Lecture 12: Data Quality and Integration
Managing Data Resources File Organization and databases for business information systems.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Plan for Populating a DW
Overview of MDM Site Hub
Summarized from various resources Modern Database Management
Data Warehouse.
MANAGING DATA RESOURCES
Data Warehouse.
Data Warehousing Concepts
Presentation transcript:

ETL Concepts

Scope of the Training What is ETL? Need for ETL ETL Glossary The ETL Process Data Extraction and Preparation Data Cleansing Data Transformation Data Load Data Refresh Strategies ETL Solution Options Characteristics of ETL Tools Types of ETL Tools ETL Tool - selection criteria Key tools in the market

What is ETL? ETL stands for Extraction, Transformation and Load This is the most challenging, costly and time consuming step towards building any type of Data warehouse. This step usually determines the success or failure of a Data warehouse because any analysis lays a lot of importance on data and the quality of data that is being analyzed.

What is ETL?

What is ETL? - Extraction The process of culling out data that is required for the Data Warehouse from the source system Can be to a file or to a database Could involve some degree of cleansing or transformation Can be automated since it becomes repetitive once established

What is ETL? - Transformation & Cleansing Modification or transformation of data being imported into the Data Warehouse. Usually done with the purpose of ensuring ‘clean’ and ‘consistent’ data Cleansing The process of removing errors and inconsistencies from data being imported to a data warehouse Could involve multiple stages Feedback could go back for strengthening OLTP data capture mechanism

What is ETL? - Loading After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse Issues huge volumes of data to be loaded small time window available when warehouse can be taken off line (usually nights) when to build index and summary tables allow system administrators to monitor, cancel, resume, change load rates Recover gracefully -- restart after failure from where you were and without loss of data integrity

What is ETL? - Loading Techniques batch load utility: sort input records on clustering key and use sequential I/O; build indexes and derived tables sequential loads still too long use parallelism and incremental techniques

The Need for ETL Facilitates Integration of data from various data sources for building a Datawarehouse Note: Mergers and acquisitions also create disparities in data representation and pose more difficult challenges in ETL. Businesses have data in multiple databases with different codification and formats Transformation is required to convert and to summarize operational data into a consistent, business oriented format Pre-Computation of any derived information Summarization is also carried out to pre-compute summaries and aggregates Makes data available in a queriable format

The Need for ETL - Example Data Warehouse appl A - m,f appl B - 1,0 appl C - x,y appl D - male, female encoding appl A - pipeline - cm appl B - pipeline - in appl C - pipeline - feet appl D - pipeline - yds unit appl A - balance appl B - bal appl C - currbal appl D - balcurr field

Data Integrity Problems - Scenarios Same person, different spellings Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company name Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different names mumbai, bombay Different account numbers generated by different applications for the same customer Required fields left blank Invalid product codes collected at point of sale manual entry leads to mistakes “in case of a problem use 9999999”

ETL Glossary Extracting Conditioning House holding Enrichment Scoring

ETL Glossary Extracting Capture of data from operational source in “as is” status Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix Conditioning The conversion of data types from the source to the target data store (warehouse) -- always a relational database

ETL Glossary House holding Identifying all members of a household (living at the same address) Ensures only one mail is sent to a household Can result in substantial savings: 1 lakh catalogues at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh. Enrichment Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc... Scoring computation of a probability of an event. e.g..., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product

The ETL Process Access data dictionaries defining source files Build logical and physical data models for target data Identify sources of data from existing systems Specify business and technical rules for data extraction, conversion and transformation Perform data extraction and transformation Load target databases

The ETL Process – Push vs. Pull Pull :- A Pull strategy is initiated by the Target System. As a part of the Extraction Process, the source data can be pulled from Transactional system into a staging area by establishing a connection to the relational/flat/ODBC sources. Advantage :- No additional space required to store the data that needs to be loaded into to the staging database Disadvantage :- Burden on the Transactional systems when we want to load data into the staging database OR Push :- A Push strategy is initiated by the Source System. As a part of the Extraction Process, the source data can be pushed/exported or dumped onto a file location from where it can loaded into a staging area. Advantage :- No additional burden on the Transactional systems when we want to load data into the staging database Disadvantage :- Additional space required to store the data that needs to be loaded into to the staging database

The ETL Process – Push vs. Pull With a PUSH strategy, the source system area maintains the application to read the source and create an interface file that is presented to your ETL. With a PULL strategy, the DW maintains the application to read the source.

The ETL Process - Data Extraction and Preparation Stage I Extract Stage II Analyze, Clean and Transform Periodic Refresh/ Update Data Movement and Load Stage III

The ETL Process – A simplified picture OLTP Systems Transform Staging Area OLTP Systems Data Warehouse Load Extract OLTP Systems Stage I Stage II Stage III

The ETL Process – Step1 Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse Incremental extract = capturing changes that have occurred since the last static extract Static extract = capturing a snapshot of the source data at a point in time

The ETL Process – Step2 Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

The ETL Process – Step3 Record-level: Field-level: Transform = convert data from format of operational system to format of data warehouse Record-level: Selection – data partitioning Joining – data combining Aggregation – data summarization Field-level: Single-field – from one field to one field Multi-field – from many fields to one, or one field to many

The ETL Process – Step4 Load/Index= place transformed data into the warehouse and create indexes Refresh mode: bulk rewriting of target data at periodic intervals Update mode: only changes in source data are written to data warehouse

The ETL Process - Data Transformation Transforms the data in accordance with the business rules and standards that have been established Example include: format changes, de-duplication, splitting up fields, replacement of codes, derived values, and aggregates

Scrubbing/Cleansing Data Sophisticated transformation tools used for improving the quality of data Clean data is vital for the success of the warehouse Example Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person

Reasons for “Dirty” data Dummy Values Absence of Data Multipurpose Fields Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules Reused Primary Keys Non-Unique Identifiers Data Integration Problems

The ETL Process - Data Cleansing Source systems contain “dirty data” that must be cleansed ETL software contains rudimentary data cleansing capabilities Specialized data cleansing software is often used. Important for performing name and address correction and house holding functions Leading data cleansing/Quality Technology vendors include IBM (QualityStage & ProfileStage), Harte-Hanks (Trillium Software), SAS (DataFlux) and Firstlogic

Steps in Data Cleansing Parsing Correcting Standardizing Matching Consolidating

Parsing Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. Examples include parsing the first, middle, and last name; street number and street name; and city and state.

Correcting Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. Example include replacing a vanity address and adding a zip code.

Standardizing Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. Examples include adding a pre name, replacing a nickname, and using a preferred street name.

Matching Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. Examples include identifying similar names and addresses.

Consolidating Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

Data Quality Technology Tools (Vendors) DataFlux Integration Server & dfPower ® Studio (www.DataFlux.com) Trillium Software Discovery & Trillium Software System (www.trilliumsoftware.com) ProfileStage & QualityStage (www.ascential.com)

MarketScope Update: Data Quality Technology ratings, 2005 (Source: Gartner - June 2005)

The ETL Process - Data Loading Data are physically moved to the data warehouse The loading takes place within a “load window” The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications

Data Loading - First Time Load First load is a complex exercise Data extracted from tapes, files, archives etc. First time load might take a lot of time to complete

Data Refresh Issues: when to refresh? on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes) periodically (e.g., every 24 hours, every week) or after “significant” events refresh policy set by administrator based on user needs and traffic possibly different policies for different sources how to refresh?

Data Refresh Data refreshing can follow two approaches : Complete Data Refresh Completely refresh the target table every time Data Trickle Load Replicate only net changes and update the target database

Data Refresh Techniques Snapshot Approach - Full extract from base tables read entire source table or database: expensive may be the only choice for legacy databases or files. Incremental techniques (related to work on active DBs) detect & propagate changes on base tables: replication servers (e.g., Sybase, Oracle, IBM Data Propagator) snapshots & triggers (Oracle) transaction shipping (Sybase) Logical correctness computing changes to star tables computing changes to derived and summary tables optimization: only significant changes transactional correctness: incremental load

ETL Solution Options ETL Custom Solution Generic Solution

Custom Solution Using RDBMS staging tables and stored procedures Programming languages like C, C++, Perl, Visual Basic etc Building a code generator

Custom Solution – Typical components Snapshots for dimension tables PL/SQL extraction procedure Complex views for transformation Control table and highly parameterized/generic extraction process Extract From Source Control table driven Highly configurable process PL/SQL procedure Checks performed - referential integrity, Y2K, elementary statistics, business rules Mechanism to flag the records as bad / reject Data Quality Multiple Stars extracted as separate groups Pro*C programs using embedded SQL Surrogate key generation mechanism ASCII file downloads generated for load into warehouse Generate Download Files Control Program Time window based extraction Restart at point of failure High level of error handling Control metadata captured in Oracle tables Facility to launch failure recovery programs Automatically

Generic Solution Address limitations (in scalability & complexity) of manual coding The need to deliver quantifiable business value Functionality, Reliability and Viability are no longer major issues

Characteristics of ETL Tools Provides facility to specify a large number of transformation rules with a GUI Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers in multi-threaded environment Support data extraction, cleansing, aggregation, reorganization, transformation, and load operations Generate and maintain centralized metadata Closely integrated with various RDBMS Filter data, convert codes, calculate derived values, map many source data fields to one target data field Automatic generation of data extract programs High speed loading of target data warehouses

Types of ETL Tools First-generation Code-generation products Generate the source code Second-generation Engine-driven products Generate directly executable code Note: Due to more efficient architecture, second generation tools have significant advantage over first-generation

Types of ETL Tools - First-Generation Extraction, transformation,load process run on server or host GUI interface is used to define extraction/ transformation processes Detailed transformations require coding in COBOL or C Extract program is generated automatically as source code. Source code is compiled, scheduled, and run in batch mode Uses intermediate files Program is single threaded and cannot use parallel processors Little metadata is generated automatically

First-Generation ETL Tools – Strengths and Limitations Tools are mature Programmers are familiar with code generation in COBOL or C Limitations High cost of products Complex training Extract programs have to compiled from source Many transformations have to coded manually Lack of parallel execution support Most metadata to be manually generated

First-Generation ETL Tools – Examples SAS/Warehouse Administrator Prism from Prism Solutions Passport from Apertus Carleton Corp ETI-EXTRACT Tool Suite from Evolutionary Technologies Copy Manager from Information Builders

Types of ETL Tools - Second-Generation Extraction/Transformation/Load runs on server Data directly extracted from source and processed on server Data transformation in memory and written directly to warehouse database. High throughput since intermediate files are not used Directly executable code Support for monitoring, scheduling, extraction, scrubbing, transformation, load, index, aggregation, metadata Multi-threading with use of parallel processors Automatic generation of high percentage of metadata

Second-Generation ETL Tools – Strengths and Limitations Lower cost suites, platforms, and environment Fast, efficient, and multi-threaded ETL functions highly integrated and automated Open, extensible metadata Limitations Not mature Initial tools oriented only to RDBMS sources

Second-Generation ETL Tools – Examples PowerMart from Informatica DataStage from Ardent Data Mart Solution from Sagent Technology Tapestry from D2K

ETL Tools - Examples DataStage from Ascential Software SAS System from SAS Institute Power Mart/Power Center from Informatica Sagent Solution from Sagent Software Hummingbird Genio Suite from Hummingbird Communications

ETL Tool - General Selection criteria Business Vision/Considerations Overall IT strategy/Architecture Over all cost of Ownership Vendor Positioning in the Market Performance In-house Expertise available User friendliness Training requirements to existing users References from other customers

ETL Tool – Specific Selection criteria Support to retrieve, cleanse, transform, summarize, aggregate, and load data Engine-driven products for fast, parallel operation Generate and manage central metadata repository Open metadata exchange architecture Provide end-users with access to metadata in business terms Support development of logical and physical data models

ETL Tool - Selection criteria Ease of Use / Development Capabilities Target Database Loading Data Transformation and Repair Complexity Operations Management/ Process Automation Metadata Management and Administration Data Extraction & Integration complexity High Rating Low Rating ETI Extract SAS Warehouse Administrator Informatica PowerCenter Platinum Decision Base Ardent DataStage Data Mirror Transformation Server Ardent Warehouse Executive Carleton Pureview Source: Gartner Report