ETL By Dr. Gabriel.

Slides:



Advertisements
Similar presentations
Supervisor : Prof . Abbdolahzadeh
Advertisements

C6 Databases.
Navigator Management Partners LLC Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business.
Information Integration. Modes of Information Integration Applications involved more than one database source Three different modes –Federated Databases.
Components and Architecture CS 543 – Data Warehousing.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Data Warehouse success depends on metadata
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT 1 ETL PROCESS (Muscat, Oman)
© 2003, Prentice-Hall Chapter Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
ETL Design and Development Michael A. Fudge, Jr.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Data Warehouse Tools and Technologies - ETL
Introducing ETL: Components & Architecture Michael A. Fudge, Jr.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Sayed Ahmed Logical Design of a Data Warehouse.  Free Training and Educational Services  Training and Education in Bangla: Training and Education in.
Database Systems – Data Warehousing
Systems analysis and design, 6th edition Dennis, wixom, and roth
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
More ETL. ETL in a nutshell ETL is an abbreviation of the three words Extract, Transform and Load. It is an ETL process to –extract data, mostly from.
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 10: The Data Warehouse Decision Support Systems in the 21 st.
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
1 Data Warehouses BUAD/American University Data Warehouses.
FORUM II Best Practices in Data Warehousing in Higher Education: A Framework for Higher Education Reporting April 18, 2005 Slide 1 Cornell University’s.
Fundamentals of Information Systems, Seventh Edition 1 Chapter 3 Data Centers, and Business Intelligence.
Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
ETL Extract Transform Load. Introduction of ETL ETL is used to migrate data from one database to another, to form data marts and data warehouses and also.
DAT 360: DTS in SQL Server 2000 Best Practices Euan Garden Group Manager, SQL Server Microsoft Corporation.
Transportation: Loading Warehouse Data Chapter 12.
Prepared By Aakanksha Agrawal & Richa Pandey Mtech CSE 3 rd SEM.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Data resource management
ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.
DAT 332 SQL Server 2000 Data Transformation Services (DTS) Best Practices Euan Garden Product Unit Manager SQL Server Development Microsoft Corporation.
7 Strategies for Extracting, Transforming, and Loading.
Two-Tier DW Architecture. Three-Tier DW Architecture.
© 2012 Saturn Infotech. All Rights Reserved. Oracle Hyperion Data Relationship Management Presented by: Prasad Bhavsar Saturn Infotech, Inc.
MIS 451 Building Business Intelligence Systems Data Staging.
INCREMENTAL AGGREGATION After you create a session that includes an Aggregator transformation, you can enable the session option, Incremental Aggregation.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Base SAS ® vs. SAS ® Data Integration Studio Greg Nelson and Danny Grasse.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
IST 210 Database Design Process IST 210, Section 1 Todd S. Bacastow January 2004.
Tool Support for Testing Classify different types of test tools according to their purpose Explain the benefits of using test tools.
Helping Your Data Warehouse Succeed: 10 Mistakes to Avoid in Data Integration Rafael Salas w:
Building the Corporate Data Warehouse Pindaro Demertzoglou Lally School of Management Data Resource Management.
Slide 1 © 2016, Lera Technologies. All Rights Reserved. Oracle Data Integrator By Lera Technologies.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
Supervisor : Prof . Abbdolahzadeh
Plan for Populating a DW
Advanced Applied IT for Business 2
IBM DATASTAGE online Training at GoLogica
Data Warehouse.
Unidad II Data Warehousing Interview Questions
Typically data is extracted from multiple sources
Data Warehousing Concepts
Best Practices in Higher Education Student Data Warehousing Forum
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

ETL By Dr. Gabriel

ETL Process 4 major components: Extracting Cleaning and conforming Gathering raw data from source systems and storing it in ETL staging environment Cleaning and conforming Processing data to improve its quality, format it, merge from multiple sources, enforce conformed dimensions Delivering Loading data into data warehouse tables Managing Management of ETL environment

ETL: Extracting Data profiling Identifying data that changed since last load extraction

ETL: Cleaning and Conforming Data cleansing Recording error events Audit dimensions Deduping Creating and maintaining conformed dimensions and facts

ETL: Delivering Implementation of SCD logic Surrogate key generation Managing hierarchies in dimensions Managing special dimensions such as date and time, junk, mini, shrunken, small static, and user-maintained dimensions Mini dimensions used to track changes of dimension attribute when type 2 technique is infeasible. Similar to junk dimensions Typically is used for large dimensions Combinations can be built in advance or on the fly Built from dimension table input

ETL: Delivering (Cont) Small static dimensions Dimensions created by the ETL system without real source Lookup dimensions for translations of codes, etc. User maintained dimensions Master dimensions without real source system Descriptions, groupings, hierarchies created for reporting and analysis purposes.

ETL: Delivering (Cont) Fact table loading Building and maintaining bridge dimension tables Handling late arriving data Management of conformed dimensions Administration of fact tables Building aggregations Building OLAP cubes Transferring DW data to other environment for specific purposes

ETL: Managing Management of ETL environment Goals Job scheduler Reliability Availability Manageability Job scheduler backup system Recovery and restart system Version control system

ETL: Managing (Cont.) Version migration system Workflow monitor Sorting system Analyzing dependencies and lineage Problem escalation system Parallelization Security system Compliance manager Metadata repository manager

ETL Process Planning High level source to target data flow diagram Selection and implementation of ETL tool Development of default strategies for dimension management, error handling, and other processes Development data transformations diagrams by target table Development of job sequencing

ETL Process Developing one-time historic load Build and test the historic dimension and fact tables load Developing incremental load process Build and test dimension and fact tables incremental load processes Build and test aggregate table loads and/or OLAP processing Design, build, and test the ETL system automation

ETL Tools: Build vs Buy Many off-the-shelf tools exist Benefits are not seen right away Setup Learning curve High-end tools may not justify value for smaller warehouses

Off-the-shelf ETL Tools Vendor Oracle Warehouse Builder (OWB) Oracle  Data Integrator (BODI) Business Objects IBM Information Server (Ascential) IBM SAS Data Integration Studio SAS Institute PowerCenter Informatica  Oracle Data Integrator (Sunopsis) Oracle Data Migrator Information Builders Integration Services Microsoft Talend Open Studio Talend DataFlow Group 1 Software (Sagent) Data Integrator Pervasive Transformation Server DataMirror Transformation Manager ETL Solutions Ltd. Data Manager Cognos DT/Studio Embarcadero Technologies ETL4ALL IKAN DB2 Warehouse Edition Jitterbit Pentaho Data Integration Pentaho 

ETL Specification Document Can be as large as 100 pages per business process; In reality, the work starts after the high level design is documented in a few pages. Source-to-target mappings Data profiling reports Physical design decisions Default strategy for extracting from each major source system Archival strategy Data quality tracking and metadata Default strategy for managing changes to dimension attributes

ETL Specification Document (Cont) System availability requirements and strategy Design of data auditing mechanism Location of staging areas Historic and incremental load strategies for each table Detailed table design Historic data load parameters (# of months) and volumes (# of rows) Incremental data volumes

ETL Specification Document (Cont) Handling of late arriving data Load frequency Handling of changes in each dimension attribute (types 1,2,3) Table partitioning Overview of data sources; discussion of source-specific characteristics Extract strategy for the source data Change data capture logic for each source table Dependencies Transformation logic (diagram or pseudo code)

ETL Specification Document (Cont) Preconditions to avoid error conditions Recovery and restart assumptions for each major step of the ETL pipeline Archiving assumptions for each table Cleanup steps Estimated effort Overall workflow Job sequencing Logical dependencies

Loading Pointers One time historic load Disable RI constraints (FKs) and re-enable them after the load is complete Drop indexes and re-create them after the load is complete Use bulk loading techniques Not always the case

Loading Pointers (Cont) Incremental load

Loading Pointers (Cont) Sometimes historic and incremental load logic is the same; many times- is similar. Updating aggregations, if necessary Error handling

Sample: Generation of Surrogate Keys on SQL Server As simple as: DECLARE @i INTEGER SELECT @i = MAX(ID) + 1 FROM TableName But may not work with concurrent processes OR Create PROCEDURE pGetNextID (@SeedName VARCHAR(32), @SeedValue BIGINT OUTPUT) AS UPDATE Lookup_Seed SET @SeedValue = SeedValue = SeedValue + 1 WHERE SeedID = @SeedName Lookup_Seed table: SeedID varchar (32) SeedValue bigint

Questions ?