Download presentation
Presentation is loading. Please wait.
1
Unit 4 ETL(Extract Transform Load)
2
ETL The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data Delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions.
3
ETL The ETL system makes or breaks the data warehouse.
Building the ETL system is a back room activity that is not very visible to end users, it easily consumes 70 percent of the resources needed for implementation and maintenance of a typical data warehouse. ETL system: Removes mistakes and corrects missing data Provides documented measures of confidence in data Captures the flow of transactional data for safekeeping Adjusts data from multiple sources to be used together Structures data to be usable by end-user tools
4
Data Flow in ETL
6
Data Quality
7
Introduction World of heterogeneity. Different technologies.
Different platforms. Large amount of data being generated everyday in all sorts of organizations and Enterprises. Problems with data.
8
Problems Duplicated , inconsistent , ambiguous, incomplete. So there is a need to collect data in one place and clean up the data.
9
Why data quality matters?
Good data is your most valuable asset, and bad data can seriously harm your business and credibility… What have you missed? When things go wrong. Making confident decisions.
10
What is data quality? Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given context. It is described by several dimensions like- Correctness / Accuracy : Accuracy of data is the degree to which the captured data correctly describes the real world entity. Consistency: This is about the single version of truth. Consistency means data throughout the enterprise should be sync with each other.
11
Contd… Completeness: It is the extent to which the expected attributes of data are provided. Timeliness: Right data to the right person at the right time is important for business. Metadata: Data about data.
12
Maintenance of data quality
Data quality results from the process of going through the data and scrubbing it, standardizing it, and de duplicating records, as well as doing some of the data enrichment. 1. Maintain complete data. 2. Clean up your data by standardizing it using rules. 3. Use fancy algorithms to detect duplicates. Eg: ICS and Informatics Computer System. 4. Avoid entry of duplicate leads and contacts. 5. Merge existing duplicate records. 6. Use roles for security.
13
Inconsistent data before cleaning up
Invoice 1 Bill no CustomerName SocialSecurityNumber 101 Mr. Aleck Stevenson ADWPS10017 Invoice 2 Bill no CustomerName SocialSecurityNumber 205 Mr. S Aleck ADWPS10017 Invoice 3 Bill no CustomerName SocialSecurityNumber 314 Mr. Stevenson Aleck ADWPS10017 Invoice 4 Bill no CustomerName SocialSecurityNumber 316 Mr. Alec Stevenson ADWPS10017
14
Consistent data after cleaning
Invoice 1 Bill no CustomerName SocialSecurityNumber 101 Mr. Aleck Stevenson ADWPS10017 Invoice 2 Bill no CustomerName SocialSecurityNumber 205 Mr. Aleck Stevenson ADWPS10017 Invoice 3 Bill no CustomerName SocialSecurityNumber 314 Mr. Aleck Stevenson ADWPS10017 Invoice 4 Bill no CustomerName SocialSecurityNumber 316 Mr. Aleck Stevenson ADWPS10017
15
Data Profiling
16
Context In process of data warehouse design, many database professionals face situations like: Several data inconsistencies in source, like missing records or NULL values. Or, column they chose to be the primary key column is not unique throughout the table. Or, schema design is not coherent to the end user requirement. Or, any other concern with the data, that must have been fixed right at the beginning.
17
To fix such data quality issues would mean making changes in ETL data flow packages., cleaning the identified inconsistencies etc. This in turn will lead to a lot of re-work to be done. Re-work will mean added costs to the company, both in terms of time and effort. So, what one would do in such a case?
18
Solution Instead of a solution to the problem, it would be better to catch it right at the start before it becomes a problem. Use data profiling software
19
What is data profiling ? It is the process of statistically examining and analyzing the content in a data source, and hence collecting information about the data. It consists of techniques used to analyze the data we have for accuracy and completeness. 1. Data profiling helps us make a thorough assessment of data quality. 2. It assists the discovery of anomalies in data. 3. It helps us understand content, structure, relationships, etc. about the data in the data source we are analyzing.
20
Contd… 4. It helps us know whether the existing data can be applied to other areas or purposes. 5. It helps us understand the various issues/challenges we may face in a database project much before the actual work begins. This enables us to make early decisions and act accordingly. 6. It is also used to assess and validate metadata.
21
When and how to conduct data profiling?
Generally, data profiling is conducted in two ways: Writing SQL queries on sample data extracts put into a database. Using data profiling tools.
22
When to conduct Data Profiling?
-> At the discovery/requirements gathering phase -> Just before the dimensional modeling process -> During ETL package design.
23
How to conduct Data Profiling?
Data profiling involves statistical analysis of the data at source and the data being loaded, as well as analysis of metadata. These statistics may be used for various analysis purposes.: Data quality: Analyze the quality of data at the data source. NULL values: Look out for the number of NULL values in an attribute.
24
Candidate keys: Analysis of the extent to which certain columns are distinct will give developer useful information w. r. t. selection of candidate keys. Primary key selection: To check whether the candidate key column does not violate the basic requirements of not having NULL values or duplicate values.
25
Empty string values: A string column may contain NULL or even empty sting values that may create problems later. String length: An analysis of largest and shortest possible length as well as the average string length of a sting-type column can help us decide what data type would be most suitable for the said column.
26
Identification of cardinality: The cardinality relationships are important for inner and outer join considerations with regard to several BI tools. Data format: Sometimes, the format in which certain data is written in some columns may or may not be user-friendly.
27
Common Data Profiling Software
Most of the data-integration/analysis soft-wares have data profiling built into them. Alternatively, various independent data profiling tools are also available. Some popular ones are: Trillium Enterprise Data quality Datiris Profiler Talend Data Profiler IBM Infosphere Information Analyzer SSIS Data Profiling Task Oracle Warehouse Builder
28
Data Profiling Elimination of some input fields completely
Flagging of missing data and generation of special surrogate keys Best-guess automatic replacement of corrupted values Human intervention at the record level Development of a full-blown normalized representation of the data
29
Staging Accessible only to experienced data integration professionals.
It is a back-room facility, completely off limits to end users, where the data is placed after it is extracted from the source systems, cleansed, manipulated, and prepared to be loaded to the presentation layer of the data warehouse. Any metadata generated by the ETL process that is useful to end users must come out of the back room and be offered in the presentation area of the data warehouse.
30
The Four Staging Steps of a Data Warehouse.
31
Extraction The integration of all of the disparate systems across the enterprise is the real challenge to getting the data warehouse to a state where it is usable Data is extracted from heterogeneous data sources Each data source has its distinct set of characteristics that need to be managed and integrated into the ETL system in order to effectively extract data. Q. Write design steps of dimensional modeling.
32
Extraction ETL process needs to effectively integrate systems that have different: DBMS Operating Systems Hardware Communication protocols Need to have a logical data map before the physical data can be transformed The logical data map describes the relationship between the extreme starting points and the extreme ending points of your ETL system usually presented in a table or spreadsheet
33
Extract Before you begin building your extract systems, you need a logical data map that documents the relationship between original source fields and final destination fields in the tables you deliver to the front room. This document ties the very beginning of the ETL system to the very end.
34
Logical data map Have a plan -foundation of the metadata
Identify data source candidates- identify the likely candidate data sources you believe will support the decisions needed by the business community Analyze source systems with a data-profiling tool - detected data anomaly must be documented, and best efforts must be made to apply appropriate business rules to rectify data before it is loaded into the data warehouse.
35
4. Receive walk-though of data lineage and business rules - target data model is understood, the data warehouse architect and business analyst must walk the ETL architect and developers through the data lineage and business rules for extracting, transforming, and loading the subject areas of the data warehouse
36
5. Receive walk-through of data warehouse data model
5. Receive walk-through of data warehouse data model. The ETL team must completely understand the physical data model of the data warehouse. This understanding includes dimensional modeling concepts. Understanding the mappings on a table-by-table basis is not good enough.
37
6. Validate calculations and formulas.
It is helpful to make sure the calculations are correct before you spend time coding the wrong algorithms in your ETL process.
38
Components of the Logical Data Map
Presented in a table or spreadsheet format and includes the following specific components: Target table name. The physical name of the table as it appears in the data warehouse Target column name. The name of the column in the data warehouse table Table type. Indicates if the table is a fact, dimension, or subdimension SCD (slowly changing dimension) type. For dimensions, this component indicates a Type-1, -2, or -3 slowly changing dimension approach. This indicator can vary for each column in the dimension.
39
Components of the Logical Data Map
Source database. The name of the instance of the database where the source data resides – -connect string required to connect to the database. - name of a file it appears in the file system Source table name. The name of the table where the source data originates. list all tables required to populate the relative table in the target data warehouse. Source column name. The column or columns necessary to populate the target. list all of the columns required to load the target column. The associations of the source columns are documented in the transformation section. Transformation. The exact manipulation required of the source data so it corresponds to the expected format of the target. This component is usually notated in SQL or pseudo-code.
40
Using Tools for the Logical Data Map
41
Target Source Transformation Table Name Column Name Data Type The content of the logical data mapping document has been proven to be the critical element required to efficiently plan ETL processes The table type gives us our queue for the ordinal position of our data load processes—first dimensions, then facts. The primary purpose of this document is to provide the ETL developer with a clear-cut blueprint of exactly what is expected from the ETL process. This table must depict, without question, the course of action involved in the transformation process
42
The transformation can contain anything from the absolute solution to nothing at all. Most often, the transformation can be expressed in SQL. The analysis of the source system is usually broken into two major phases: The data discovery phase The anomaly detection phase
43
Extraction - Data Discovery Phase
Data Discovery Phase key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it Once you understand what the target needs to look like, you need to identify and examine the data sources
44
Data Discovery Phase It is up to the ETL team to drill down further into the data requirements to determine each and every source system, table, and attribute required to load the data warehouse Collecting and Documenting Source Systems Keeping track of source systems Determining the System of Record - Point of originating of data Definition of the system-of-record is important because in most enterprises data is stored redundantly across many different systems. Enterprises do this to make nonintegrated systems share data. It is very common that the same piece of data is copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise, resulting in varying versions of the same data
45
Data Content Analysis - Extraction
Understanding the content of the data is crucial for determining the best approach for retrieval - NULL values. An unhandled NULL value can destroy any ETL process. NULL values pose the biggest risk when they are in foreign key columns. Joining two or more tables based on a column that contains NULL values will cause data loss! Check for NULL values in every foreign key in the source database. When NULL values are present, you must outer join the tables - Dates in nondate fields. Dates are very peculiar elements because they are the only logical elements that can come in various formatsFortunately, most database systems support most of the various formats for display purposes but store them in a single standard format
46
During the initial load, capturing changes to data content in the source data is unimportant because you are most likely extracting the entire data source or a potion of it from a predetermined point in time. Later the ability to capture data changes in the source system instantly becomes priority The ETL team is responsible for capturing data-content changes during the incremental load.
47
Determining Changed Data
Audit Columns : Used by DB and updated by triggers Audit columns are appended to the end of each table to store the date and time a record was added or modified You must analyze and test each of the columns to ensure that it is a reliable source to indicate changed data. If you find any NULL values, you must find an alternative approach for detecting change using outer joins
48
Determining Changed Data
Process of Elimination Process of elimination preserves exactly one copy of each previous extraction in the staging area for future use. During the next run, the process takes the entire source table(s) into the staging area and makes a comparison against the retained data from the last process. Only differences (deltas) are sent to the data warehouse. Not the most efficient technique, but most reliable for capturing changed data
49
Determining Changed Data
Initial and Incremental Loads Create two tables: previous load and current load. The initial process bulk loads into the current load table. Since change detection is irrelevant during the initial load, the data continues on to be transformed and loaded into the ultimate target fact table. When the process is complete, it drops the previous load table, renames the current load table to previous load, and creates an empty current load table. The next time the load process is run, the current load table is populated. Select the current load table MINUS the previous load table. Transform and load the result set into the data warehouse.
50
Transformation
51
Transformation Main step where the ETL adds value
Actually changes data and provides guidance whether data can be used for its intended purposes Performed in staging area
52
Transformation Data Quality paradigm Correct Unambiguous Consistent
Complete Data quality checks are run at 2 places - after extraction and after cleaning and confirming additional check are run at this point
53
Transformation - Cleaning Data
Anomaly Detection Data sampling – count of the rows for a department column Column Property Enforcement Null Values in reqd columns Numeric values that fall outside of expected high and lows Cols whose lengths are exceptionally short/long Cols with certain values outside of discrete valid value sets Adherence to a reqd pattern/ member of a set of pattern
54
Thoroughness –caring Tightrope
55
Transformation - Confirming
Structure Enforcement Tables have proper primary and foreign keys Obey referential integrity Data and Rule value enforcement Simple business rules Logical data checks
56
Stop Yes Fatal Errors Cleaning And Confirming Loading Staged Data No
57
Loading Dimensions Loading Facts
58
Loading Dimensions The primary key is a single field containing meaningless unique integer – Surrogate Keys The DW owns these keys and never allows any other entity to assign them De-normalized flat tables – all attributes in a dimension must take on a single value in the presence of a dimension primary key. Should possess one or more other fields that compose the natural key of the dimension
60
The data loading module consists of all the steps required to administer slowly changing dimensions (SCD) and write the dimension to disk as a physical table in the proper dimensional format with correct primary keys, correct natural keys, and final descriptive attributes. Creating and assigning the surrogate keys occur in this module. The table is definitely staged, since it is the object to be loaded into the presentation system of the data warehouse.
61
Loading dimensions When DW receives notification that an existing row in dimension has changed it gives out 3 types of responses Type 1 Type 2 Type 3
62
Type 1 Dimension
63
Type 2 Dimension
64
Type 3 Dimensions
65
Loading facts Facts Fact tables hold the measurements of an enterprise. The relationship between fact tables and measurements is extremely simple. If a measurement exists, it can be modeled as a fact table row. If a fact table row exists, it is a measurement
66
Key Building Process - Facts
When building a fact table, the final ETL step is converting the natural keys in the new input records into the correct, contemporary surrogate keys ETL maintains a special surrogate key lookup table for each dimension. This table is updated whenever a new dimension entity is created and whenever a Type 2 change occurs on an existing dimension entity All of the required lookup tables should be pinned in memory so that they can be randomly accessed as each incoming fact record presents its natural keys. This is one of the reasons for making the lookup tables separate from the original data warehouse dimension tables.
67
Key Building Process
68
Loading Fact Tables Managing Indexes Performance Killers at load time
Drop all indexes in pre-load time Segregate Updates from inserts Load updates Rebuild indexes
69
Managing Partitions Partitions allow a table (and its indexes) to be physically divided into minitables for administrative purposes and to improve query performance The most common partitioning strategy on fact tables is to partition the table by the date key. Because the date dimension is preloaded and static Need to partition the fact table on the key that joins to the date dimension for the optimizer to recognize the constraint. The ETL team must be advised of any table partitions that need to be maintained.
70
Maintaining the Rollback Log
The rollback log, also known as the redo log, is invaluable in transaction (OLTP) systems. But in a data warehouse environment where all transactions are managed by the ETL process, the rollback log is a superfluous feature that must be dealt with to achieve optimal load performance. Reasons why the data warehouse does not need rollback logging are: All data is entered by a managed process—the ETL system. Data is loaded in bulk. Data can easily be reloaded if a load process fails. Each database management system has different logging features and manages its rollback log differently
71
References “The Data Warehouse ETL Toolkit” by Ralph Kimball
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.