Data Staging Data Staging Legacy System Data Warehouse SQL Server

Slides:



Advertisements
Similar presentations
C6 Databases.
Advertisements

Unit 7: Store and Retrieve it Database Management Systems (DBMS)
Management Information Systems, Sixth Edition
Designing the Data Warehouse and Data Mart Methodologies and Techniques.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Components and Architecture CS 543 – Data Warehousing.
Data Warehouse success depends on metadata
MIS DATABASE SYSTEMS, DATA WAREHOUSES, AND DATA MARTS CHAPTER 3
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
ETL Design and Development Michael A. Fudge, Jr.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
Chapter 4: Organizing and Manipulating the Data in Databases
Chapter 4-1. Chapter 4-2 Database Management Systems Overview  Not a database  Separate software system Functions  Enables users to utilize database.
ACS1803 Lecture Outline 2 DATA MANAGEMENT CONCEPTS Text, Ch. 3 How do we store data (numeric and character records) in a computer so that we can optimize.
Database Systems – Data Warehousing
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Data Profiling
Data Warehouse Concepts Transparencies
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Chapter 4: Organizing and Manipulating the Data in Databases
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
BUS1MIS Management Information Systems Semester 1, 2012 Week 6 Lecture 1.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Chapter 4 Data and Databases. Learning Objectives Upon successful completion of this chapter, you will be able to: Describe the differences between data,
Sachin Goel (68) Manav Mudgal (69) Piyush Samsukha (76) Rachit Singhal (82) Richa Somvanshi (85) Sahar ( )
Data Warehouse. Group 5 Kacie Johnson Summer Bird Washington Farver Jonathan Wright Mike Muchane.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
DATA RESOURCE MANAGEMENT

7 Strategies for Extracting, Transforming, and Loading.
0 / Database Management. 1 / Identify file maintenance techniques Discuss the terms character, field, record, and table Describe characteristics.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
MIS 451 Building Business Intelligence Systems Data Staging.
Data Mining What is to be done before we get to Data Mining?
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Supervisor : Prof . Abbdolahzadeh
Data CLEANSING Getting Data Ready.
Plan for Populating a DW
Pengantar Sistem Informasi
CHAPTER SIX DATA Business Intelligence
Introduction to Computing Lecture # 13
Defining Data Warehouse Concepts and Terminology
Overview of MDM Site Hub
Data warehouse and OLAP
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
ACS1803 Lecture Outline 2   DATA MANAGEMENT CONCEPTS Text, Ch. 3
Defining Data Warehouse Concepts and Terminology
CHAPTER SIX OVERVIEW SECTION 6.1 – DATABASE FUNDAMENTALS
Data Warehouse.
Metadata The metadata contains
Chapter 17 Designing Databases
Data Warehousing Concepts
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

Data Staging Data Staging Legacy System Data Warehouse SQL Server Access Excel Legacy System

Data Staging Extraction Data Cleansing Data Integration Transformation Transportation (Loading) Maintenance

Data Staging Area The construction site for the warehouse Required by most scenarios Connected to wide variety of sources Clean / aggregate / compute / validate data Operational system Data staging area Warehouse Extract Transport (Load) Transform

Extraction Extract source data from legacy systems and place it in a staging area. To reduce the impact on the performance of legacy systems, source data is extracted without any cleansing, integration and transformation operations.

In Data Warehouse terms, a data staging area is an intermediate storage area between the sources of information and the data warehouse (DW) . It is usually of temporary nature and its contents can be erased after the DW/DM has been loaded successfully. A staging area can be used for any of the following purposes, among others: To gather data from different sources that will be ready to process at different times. To quickly load information from the operational database, To find changes against current DW/DM values. For data cleansing To pre-calculate aggregates.

Data Mart A subset of a data warehouse that supports the requirements of a particular department or business function. Characteristics include: Do not normally contain detailed operational data unlike data warehouses. May contain certain levels of aggregation

Extraction A variety of file formats exist in legacy systems Relational database: DB, Oracle, SQL Server, Access … Flat file: Excel file, text file Commercial data extraction tools are very helpful in data extraction.

Data Preparation (Cleansing) It’s all about data quality!!!

Outline Measures for Data Quality Causes for data errors Common types of data errors Common error checks Correcting missing values Timing for error checks and corrections Steps of data preparation

Measures (Dealing)for Data Quality Correctness/Accuracy - w.r.t. the real data Consistency/Uniqueness – data values, references, measures and interpretations Completeness - scope of data & values Relevancy – w.r.t. the requirements Current data – relevant to the requirements

Causes for Data Errors Data entry errors Correct data not available at the time of data entries By different users same time or same users overtime Inconsistent or incorrect use of “codes” Inconsistent or incorrect interpretation of “fields” Transaction processing errors System and recovery errors Data extract/transformation errors

Common Data Errors Missing (null) values Incorrect use of default values (e.g., zero) Data value (dependency) integrity violation Data referential integrity violation (e.g., a customer’s order record cannot exist unless the customer record already exists)

Common Data Errors, Cont’d Data retention integrity violation (e.g., old inventory snapshots should not be stored) Data Derivation/Transformation/Aggregation Integrity Violation Inconsistent data values of the same data (M versus m for male) Inconsistent use of the same data value (DM for Data Mining and Data Marts)

Error Checks Referential integrity validation Identify missing-value or default-value records Identify outliers (Exception) Process validation

Data Cleaning: Missing Values Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table or database, and refers to identifying incomplete, incorrect, parts of the data and then replacing, modifying, or deleting the dirty data .

Data Quality and Cleansing To get quality data into in the warehouse the data gathering process must be well designed. Clean data comes from two processes: entering clean data and cleaning up problems once the data are entered.

Data Quality and Cleansing Characteristics of quality data : Accurate –data in Warehouse matches with the system of record. Complete – the data in the warehouse represents the entire set of relevant data Consistent: the data in the warehouse is free from contradiction (Uniqueness) Timely: data must be updated on a scheduled that is useful to the business users

Data Improvement - The actual content of the data presents a substantial(large) challenge to the warehouse: Inconsistent or incorrect use of codes and special characters: Overloaded codes: Evolving data: Missing, incorrect or duplicate values

Approach to improving data Following steps helps to deal with data cleansing issues: Identify the highest quality source system Examine the code how bad it is. Find the minor variations in spelling during scanning lists. Fix the problem with the source if at all possible Fix some problem during data staging. Use data cleansing tools against the data and use trusted sources for correct values. Work with the source system owners for regular examination and cleansing of the source system Make the source system team responsible for a clean extract

Timing for Error Checking During Data Staging During Data Loading Others Before data extraction (data entries, transaction processing, recovery, audits, etc.) After data loading

Steps of Data Preparation Identify data sources Extract and analyze source data Standardize data Correct and complete data Match and consolidate data Transform and enhance data into target Calculate derivations and summary data Audit and control data extract, transformation and loading

Data Integration Data from different data sources with different formats need to be integrated into one data warehouse Ex: 3 customer table in sales department, marketing department and an acquired company Customer (cid, cname, city …) Customer (customerid, customername,city…) Customer (custid, custname, cname,…)

Data Integration Same attribute with different name: cid, customerid, custid Different attribute with same name: cname -> customer name cname -> city name Same attribute with different formats

Data Integration How to integrate Get the schemas of all data sources A database schema is the skeleton structure that represents the logical view of the entire database. It defines how the data is organized and how the relations among them are associated. Get the schema of the data warehouse Integrate source schemas with the help from commercial tools and domain experts

Stepwise plan for creating data staging application… High Level Plan Data Staging Tools Detailed Plan

Step 1: Plan effectively Planning phase starts out with the high-level plan. Step 1: High-Level Plan Start the design process. Keep it very high-level (that highlights the data coming from and challenges we know). Data staging applications perform 3 major steps : Extract from source Transform it Load into warehouse

Step 2: Data staging tools.. Data staging tools are system code generators. The data staging tool is used instead of hand-coding the extracts. Transformation engines are designed to improve scalability. Ex: Tools like Information Builders - Data Migrator, Oracle - Data Integrator

Organize the data staging area. Step 3: Detailed plan.. Drill down on each of the flows & phase of ETL Process. Cleaning, Integration, Combined data from different sources Plan which table to work on in which order. Organize the data staging area. It is the place where the raw data is loaded, cleaned and combined and exported.

Transformation Prepare data for loading into the data warehouse Change the data format (One source format to other)

Maintenance Maintenance frequency: daily, weekly, monthly Identify change records and new records in legacy systems Create timestamps for changes and new records in legacy systems Compare data between legacy systems and DW Load changes and new records into DW

Data Warehouse Architecture Overview

Data Mining Applications Data mining covers wide field and diverse applications Some application domains Financial data analysis (Finance and Banking) Retail and Telecommunication industries Data Mining in Science & Engineering

Data Mining for Financial Data Analysis Financial data collected in banks and financial institutions are often relatively complete, reliable, and of high quality Data Mining for Retail Industry Retail industry: huge amounts of data on sales, customer shopping history, etc. Applications of retail data mining Identify customer buying behaviors Discover customer shopping patterns and trends Improve the quality of customer service Achieve better customer retention and satisfaction Design more effective goods transportation and distribution policies

Data Mining for Telecomm. Industry A rapidly expanding and highly competitive industry and a great demand for data mining Understand the business involved Identify telecommunication patterns Catch fake activities Make better use of resources Improve the quality of service

Tools for Data Mining Rapid Miner (This is very popular since it is a ready made, open source, no-coding required software) WEKA (This is a JAVA based customization tool, which is free to use) R-Programming Tool (This is written in C and FORTRAN, and allows the data miners to write scripts just like a programming language/platform.) Python based Orange (Python is very popular due to ease of use and its powerful features. Orange is an open source tool that is written in Python with useful data analytic)