Ahsan Abdullah 1 Data Warehousing Lecture-18 ETL Detail: Data Extraction & Transformation Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. &

Slides:



Advertisements
Similar presentations
IS 4420 Database Fundamentals Chapter 11: Data Warehousing Leon Chen
Advertisements

BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
DENORMALIZATION CSCI 6442 © Copyright 2015, David C. Roberts, all rights reserved.
James Serra – Data Warehouse/BI/MDM Architect
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Lecture-19 ETL Detail: Data Cleansing
Data Integration Combining data from different sources, providing a unified view of the data Combining data from different sources, providing a unified.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-5 Types & Typical Applications of DWH Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Data Storage Formats Files Databases
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
PARTITIONING “ A de-normalization practice in which relations are split instead of merger ”
Lecture-33 DWH Implementation: Goal Driven Approach (1)
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
© 2011 Infotech Enterprises. All Rights Reserved We deliver Global Engineering Solutions. Efficiently.August 7, 2015 Geo-Technical Data management – A.
Chapter 4 Data Warehousing.
Pokročilé databázové technológie Genči
Lecture-1 Introduction and Background
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-11 Multidimensional OLAP (MOLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Data Warehousing Lecture-13 Dimensional Modeling (DM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
© 2007 by Prentice Hall 1 Introduction to databases.
Ahsan Abdullah 1 Data Warehousing Lecture-7De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-4 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
CS 157B: Database Management Systems II March 20 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron Mak.
1 Data Warehouses BUAD/American University Data Warehouses.
1 Data Warehousing. 2Definition Data Warehouse Data Warehouse: – A subject-oriented, integrated, time-variant, non- updatable collection of data used.
Ahsan Abdullah 1 Data Warehousing Lecture-9 Issues of De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
Data Warehousing 1 Lecture-28 Need for Speed: Join Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Data Warehousing Lecture-14 Process of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Ahsan Abdullah 1 Data Warehousing Lecture-20 Data Duplication Elimination & BSN Method Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-2 Introduction and Background Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Database System Introduction to Database Environment October 31, 2009 Software Park, Bangkok Thailand Pree Thiengburanathum College of Arts and Media Chiang.
Ahsan Abdullah 1 Data Warehousing Lecture-10 Online Analytical Processing (OLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Data Management for Decision Support Session-3 Prof. Bharat Bhasker.
Data Warehousing Lecture-31 Supervised vs. Unsupervised Learning Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Ahsan Abdullah 1 Data Warehousing Lecture-16 Extract Transform Load (ETL) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for.
Database Management System Prepared by Dr. Ahmed El-Ragal Reviewed & Presented By Mr. Mahmoud Rafeek Alfarra College Of Science & Technology- Khan younis.
1 Data Warehousing Lecture-15 Issues of Dimensional Modeling Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Data Warehousing Lecture-30 What can Data Mining do? Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-29 Brief Intro. to Data Mining Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
GLOBEX INFOTEK Copyright © 2013 Dr. Emelda Ntinglet-DavisSYSTEMS ANALYSIS AND DESIGN METHODSINTRODUCTORY SESSION EFFECTIVE DATABASE DESIGN for BEGINNERS.
Chapter 11: Data Warehousing Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
7 Strategies for Extracting, Transforming, and Loading.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-22 DQM: Quantifying Data Quality Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center.
Ahsan Abdullah 1 Data Warehousing Lecture-6Normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Carnegie Mellon University © Robert T. Monroe Management Information Systems Data Warehousing Management Information Systems Robert.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Ahsan Abdullah 1 Data Warehousing Lecture-8 De-normalization Techniques Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 9: DATA WAREHOUSING.
DWH-Ahsan Abdullah 1 Data Warehousing Lecture-21 Introduction to Data Quality Management (DQM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Lecture-3 Introduction and Background
Lecture-32 DWH Lifecycle: Methodologies
Modern Systems Analysis and Design Third Edition
Summarized from various resources Modern Database Management
Typically data is extracted from multiple sources
Lecture-38 Case Study: Agri-Data Warehouse
Lecture-35 DWH Implementation: Pitfalls, Mistakes, Keys
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

Ahsan Abdullah 1 Data Warehousing Lecture-18 ETL Detail: Data Extraction & Transformation Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research National University of Computers & Emerging Sciences, Islamabad

Ahsan Abdullah 2 ETL Detail: Data Extraction & Transformation

Ahsan Abdullah 3 Extracting Changed Data Incremental data extraction Incremental data extraction i.e. what has changed, say during last 24 hrs if considering nightly extraction. Efficient when changes can be identified This is efficient, when the small changed data can be identified efficiently. Identification could be costly Unfortunately, for many source systems, identifying the recently modified data may be difficult or effect operation of the source system. Very challenging Change Data Capture is therefore, typically the most challenging technical issue in data extraction. ONLY yellow part will go to Graphics

Ahsan Abdullah 4 Source Systems Two CDC sources Modern systems Legacy systems ONLY yellow part will go to Graphics

Ahsan Abdullah 5 CDC in Modern Systems Time Stamps Works if timestamp column present If column not present, add column May not be possible to modify table, so add triggers Triggers Create trigger for each source table Following each DML operation trigger performs updates Record DML operations in a log Partitioning Table range partitioned, say along date key Easy to identify new data, say last week’s data ONLY yellow part will go to Graphics

Ahsan Abdullah 6 CDC in Legacy Systems  Changes recorded in tapes Changes occurred in legacy transaction processing are recorded on the log or journal tapes.  Changes read and removed from tapes Log or journal tape are read and the update/transaction changes are stripped off for movement into the data warehouse.  Problems with reading a log/journal tape are many:  Contains lot of extraneous data  Format is often arcane  Often contains addresses instead of data values and keys  Sequencing of data in the log tape often has deep and complex implications  Log tape varies widely from one DBMS to another. ONLY yellow part will go to Graphics

Ahsan Abdullah 7 Advantages 1.Immediate. 2.No loss of history 3.Flat files NOT required CDC Advantages: Modern Systems Modern Systems

Ahsan Abdullah 8 Advantages 1.No incremental on-line I/O required for log tape 2.The log tape captures all update processing 3.Log tape processing can be taken off-line. 4.No haste to make waste. CDC Advantages: Legacy Systems Legacy Systems

Ahsan Abdullah 9 Major Transformation Types  Format revision  Decoding of fields  Calculated and derived values  Splitting of single fields  Merging of information  Character set conversion  Unit of measurement conversion  Date/Time conversion  Summarization  Key restructuring  Duplication

Ahsan Abdullah 10  Format revision  Decoding of fields  Calculated and derived values  Splitting of single fields Covered in issues Covered in De-Norm ONLY yellow part will go to Graphics Major Transformation Types

Ahsan Abdullah 11  Merging of information  Character set conversion  Unit of measurement conversion  Date/Time conversion Not really means combining columns to create one column. Info for product coming from different sources merging it into single entity. ONLY yellow part will go to Graphics For PC architecture converting legacy EBCIDIC to ASCII For companies with global branches Km vs. mile or lb vs Kg November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format. This date may be standardized to be written as 14 NOV Major Transformation Types

Ahsan Abdullah 12  Aggregation & Summarization  How they are different?  Why both are required?  Grain mismatch (don’t require, don’t have space)  Data Marts requiring low detail  Detail losing its utility Adding like values Summarization with calculation across business dimension is aggregation. Example Monthly compensation = monthly sale + bonus ONLY yellow part will go to Graphics Major Transformation Types

Ahsan Abdullah 13  Key restructuring (inherent meaning at source)  i.e changed to  Removing duplication Country_CodeCity_CodePost_CodeProduct_Code ONLY yellow part will go to Graphics Incorrect or missing value Inconsistent naming convention ONE vs 1 Incomplete information Physically moved, but address not changed Misspelling or falsification of names Major Transformation Types

Ahsan Abdullah 14 Data content defects Domain value redundancy  Non-standard data formats  Non-atomic data values  Multipurpose data fields  Embedded meanings  Inconsistent data values  Data quality contamination

Ahsan Abdullah 15 Domain value redundancy  Unit of Measure  Dozen, Doz., Dz., 12  Non-standard data formats  Phone Numbers  or  Non-atomic data fields  Name & Addresses  Dr. Hameed Khan, PhD ONLY yellow part will go to Graphics Data content defects Examples

Ahsan Abdullah 16  Embedded Meanings  RC, AP, RJ  received, approved, rejected Data content defects Examples