Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Staging Data Staging Legacy System Data Warehouse SQL Server

Similar presentations


Presentation on theme: "Data Staging Data Staging Legacy System Data Warehouse SQL Server"— Presentation transcript:

1 Data Staging Data Staging Legacy System Data Warehouse SQL Server
Access Excel Legacy System

2 Data Staging Extraction Data Cleansing Data Integration Transformation
Transportation (Loading) Maintenance

3 Data Staging Area The construction site for the warehouse
Required by most scenarios Connected to wide variety of sources Clean / aggregate / compute / validate data Operational system Data staging area Warehouse Extract Transport (Load) Transform

4 Extraction Extract source data from legacy systems and place it in a staging area. To reduce the impact on the performance of legacy systems, source data is extracted without any cleansing, integration and transformation operations.

5 In Data Warehouse terms, a data staging area is an intermediate storage area between the sources of information and the data warehouse (DW) . It is usually of temporary nature and its contents can be erased after the DW/DM has been loaded successfully. A staging area can be used for any of the following purposes, among others: To gather data from different sources that will be ready to process at different times. To quickly load information from the operational database, To find changes against current DW/DM values. For data cleansing To pre-calculate aggregates.

6 Data Mart A subset of a data warehouse that supports the requirements of a particular department or business function. Characteristics include: Do not normally contain detailed operational data unlike data warehouses. May contain certain levels of aggregation

7 Extraction A variety of file formats exist in legacy systems
Relational database: DB, Oracle, SQL Server, Access … Flat file: Excel file, text file Commercial data extraction tools are very helpful in data extraction.

8 Data Preparation (Cleansing)
It’s all about data quality!!!

9 Outline Measures for Data Quality Causes for data errors
Common types of data errors Common error checks Correcting missing values Timing for error checks and corrections Steps of data preparation

10 Measures (Dealing)for Data Quality
Correctness/Accuracy - w.r.t. the real data Consistency/Uniqueness – data values, references, measures and interpretations Completeness - scope of data & values Relevancy – w.r.t. the requirements Current data – relevant to the requirements

11 Causes for Data Errors Data entry errors
Correct data not available at the time of data entries By different users same time or same users overtime Inconsistent or incorrect use of “codes” Inconsistent or incorrect interpretation of “fields” Transaction processing errors System and recovery errors Data extract/transformation errors

12 Common Data Errors Missing (null) values
Incorrect use of default values (e.g., zero) Data value (dependency) integrity violation Data referential integrity violation (e.g., a customer’s order record cannot exist unless the customer record already exists)

13 Common Data Errors, Cont’d
Data retention integrity violation (e.g., old inventory snapshots should not be stored) Data Derivation/Transformation/Aggregation Integrity Violation Inconsistent data values of the same data (M versus m for male) Inconsistent use of the same data value (DM for Data Mining and Data Marts)

14 Error Checks Referential integrity validation
Identify missing-value or default-value records Identify outliers (Exception) Process validation

15 Data Cleaning: Missing Values
Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table or database, and refers to identifying incomplete, incorrect, parts of the data and then replacing, modifying, or deleting the dirty data .

16 Data Quality and Cleansing
To get quality data into in the warehouse the data gathering process must be well designed. Clean data comes from two processes: entering clean data and cleaning up problems once the data are entered.

17 Data Quality and Cleansing
Characteristics of quality data : Accurate –data in Warehouse matches with the system of record. Complete – the data in the warehouse represents the entire set of relevant data Consistent: the data in the warehouse is free from contradiction (Uniqueness) Timely: data must be updated on a scheduled that is useful to the business users

18 Data Improvement - The actual content of the data presents a substantial(large) challenge to the warehouse: Inconsistent or incorrect use of codes and special characters: Overloaded codes: Evolving data: Missing, incorrect or duplicate values

19 Approach to improving data
Following steps helps to deal with data cleansing issues: Identify the highest quality source system Examine the code how bad it is. Find the minor variations in spelling during scanning lists. Fix the problem with the source if at all possible Fix some problem during data staging. Use data cleansing tools against the data and use trusted sources for correct values. Work with the source system owners for regular examination and cleansing of the source system Make the source system team responsible for a clean extract

20 Timing for Error Checking
During Data Staging During Data Loading Others Before data extraction (data entries, transaction processing, recovery, audits, etc.) After data loading

21 Steps of Data Preparation
Identify data sources Extract and analyze source data Standardize data Correct and complete data Match and consolidate data Transform and enhance data into target Calculate derivations and summary data Audit and control data extract, transformation and loading

22 Data Integration Data from different data sources with different formats need to be integrated into one data warehouse Ex: 3 customer table in sales department, marketing department and an acquired company Customer (cid, cname, city …) Customer (customerid, customername,city…) Customer (custid, custname, cname,…)

23 Data Integration Same attribute with different name: cid, customerid, custid Different attribute with same name: cname -> customer name cname -> city name Same attribute with different formats

24 Data Integration How to integrate Get the schemas of all data sources
A database schema is the skeleton structure that represents the logical view of the entire database. It defines how the data is organized and how the relations among them are associated. Get the schema of the data warehouse Integrate source schemas with the help from commercial tools and domain experts

25 Stepwise plan for creating data staging application…
High Level Plan Data Staging Tools Detailed Plan

26 Step 1: Plan effectively
Planning phase starts out with the high-level plan. Step 1: High-Level Plan Start the design process. Keep it very high-level (that highlights the data coming from and challenges we know). Data staging applications perform 3 major steps : Extract from source Transform it Load into warehouse

27 Step 2: Data staging tools..
Data staging tools are system code generators. The data staging tool is used instead of hand-coding the extracts. Transformation engines are designed to improve scalability. Ex: Tools like Information Builders - Data Migrator, Oracle - Data Integrator

28 Organize the data staging area.
Step 3: Detailed plan.. Drill down on each of the flows & phase of ETL Process. Cleaning, Integration, Combined data from different sources Plan which table to work on in which order. Organize the data staging area. It is the place where the raw data is loaded, cleaned and combined and exported.

29 Transformation Prepare data for loading into the data warehouse
Change the data format (One source format to other)

30 Maintenance Maintenance frequency: daily, weekly, monthly
Identify change records and new records in legacy systems Create timestamps for changes and new records in legacy systems Compare data between legacy systems and DW Load changes and new records into DW

31

32

33 Data Warehouse Architecture Overview

34 Data Mining Applications
Data mining covers wide field and diverse applications Some application domains Financial data analysis (Finance and Banking) Retail and Telecommunication industries Data Mining in Science & Engineering

35 Data Mining for Financial Data Analysis
Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality Data Mining for Retail Industry Retail industry: huge amounts of data on sales, customer shopping history, etc. Applications of retail data mining Identify customer buying behaviors Discover customer shopping patterns and trends Improve the quality of customer service Achieve better customer retention and satisfaction Design more effective goods transportation and distribution policies

36 Data Mining for Telecomm. Industry
A rapidly expanding and highly competitive industry and a great demand for data mining Understand the business involved Identify telecommunication patterns Catch fake activities Make better use of resources Improve the quality of service

37 Tools for Data Mining Rapid Miner (This is very popular since it is a ready made, open source, no-coding required software) WEKA (This is a JAVA based customization tool, which is free to use) R-Programming Tool (This is written in C and FORTRAN, and allows the data miners to write scripts just like a programming language/platform.) Python based Orange (Python is very popular due to ease of use and its powerful features. Orange is an open source tool that is written in Python with useful data analytic)


Download ppt "Data Staging Data Staging Legacy System Data Warehouse SQL Server"

Similar presentations


Ads by Google