GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics.

Slides:



Advertisements
Similar presentations
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Advertisements

Chapter 1 Business Driven Technology
Business Intelligence (BI) PerformancePoint in SharePoint 2010 Sayed Ali – SharePoint Administrator.
C6 Databases.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Accessing Organizational Information—Data Warehouse
McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc. All rights reserved. 8-1 BUSINESS DRIVEN TECHNOLOGY Chapter Eight: Viewing and Protecting Organizational.
Information Integration. Modes of Information Integration Applications involved more than one database source Three different modes –Federated Databases.
Business Driven Technology Unit 2
Data Resource Management Data Concepts Database Management Types of Databases Chapter 5 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies,
© 2003, Prentice-Hall Chapter Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas.
Agenda 02/20/2014 Complete data warehouse design exercise Finish reconciled data warehouse, bus matrix and data mart Display each group’s work Discuss.
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
MIS2502: Data Analytics Extract, Transform, Load
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Intro to MIS – MGS351 Databases and Data Warehouses Chapter 3.
Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Databases and Data Warehouses: Supporting the Analytics-Driven.
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
Business Intelligence Zamaneh Jahed. What is Business Intelligence? Business Intelligence (BI) is a broad category of applications and technologies for.
Case 2: Emerson and Sanofi Data stewards seek data conformity
© 2007 by Prentice Hall 1 Introduction to databases.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 10: The Data Warehouse Decision Support Systems in the 21 st.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
BUS1MIS Management Information Systems Semester 1, 2012 Week 6 Lecture 1.
University of Nevada, Reno Organizational Data Design Architecture 1 Organizational Data Architecture (2/19 – 2/21)  Recap current status.  Discuss the.
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
1 Reviewing Data Warehouse Basics. Lessons 1.Reviewing Data Warehouse Basics 2.Defining the Business and Logical Models 3.Creating the Dimensional Model.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
MIS2502: Data Analytics The Information Architecture of an Organization.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.
MIS2502: Data Analytics Dimensional Data Modeling
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
McGraw-Hill/Irwin ©2009 The McGraw-Hill Companies, All Rights Reserved CHAPTER 6 DATABASES AND DATA WAREHOUSES CHAPTER 6 DATABASES AND DATA WAREHOUSES.
Chapter 6.  Problems of managing Data Resources in a Traditional File Environment  Effective IS provides user with Accurate, timely and relevant information.
Advanced Database Concepts
University of Nevada, Reno Organizational Data Design Architecture 1 Agenda for Class: 02/06/2014  Recap current status. Explain structure of assignments.
MIS 451 Building Business Intelligence Systems Data Staging.
1 Copyright © Oracle Corporation, All rights reserved. Business Intelligence and Data Warehousing.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
MBA/1092/10 MBA/1093/10 MBA/1095/10 MBA/1114/10 MBA/1115/10.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
BUSINESS INTELLIGENCE. The new technology for understanding the past & predicting the future … BI is broad category of technologies that allows for gathering,
11 Database The ultimate in data organization. 2 Database Management Systems (DBMS)  Application software designed to capture and analyze data  Four.
Intro to MIS – MGS351 Databases and Data Warehouses
CHAPTER SIX DATA Business Intelligence
Data Staging Data Staging Legacy System Data Warehouse SQL Server
Data warehouse and OLAP
Data storage is growing Future Prediction through historical data
MIS5101: Extract, Transform, Load (ETL)
Databases and Data Warehouses Chapter 3
MIS2502: Data Analytics Dimensional Data Modeling
Organizational Information – Data Warehouse
MIS5101: Business Intelligence Outcomes Measurement and Data Quality
MIS5101: Extract, Transform, Load (ETL)
MIS2502: Data Analytics Extract, Transform, Load
MIS5101: Extract, Transform, Load (ETL)
CHAPTER SIX OVERVIEW SECTION 6.1 – DATABASE FUNDAMENTALS
MIS2502: Data Analytics Extract, Transform, Load
MIS2502: Data Analytics Extract, Transform, Load
Data Warehousing Concepts
MIS2502: Data Analytics Data Extract, Transform, Load(ETL)
Presentation transcript:

GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics

Getting the information into the data mart The data in the operational database… …is put into a data warehouse… …which feeds the data mart… …and is analyzed as a cube. Now let’s address this part…

Extract, Transform, Load (ETL) The process of copying data from the transactional database to the analytical database Going from relational to dimensional Basically, it’s a matter of identifying where the data should come from to fill the data mart

ETL Defined from the operational data store (the relational database) Extract it into an analysis- ready format Transfor m it into the analytical database (the dimensional database) Load

The Actual Process Transactional Database 2 Transactional Database 1 Data Mart Query Data conversion Query Extract Transform Load

Main ETL Issues: Conversion Stage What if the data is in different formats? Data Consistency How do we know it’s correct? What if there is missing data? What if the data we need isn’t there? Data Quality

Data Consistency: The Problem with Legacy Systems An IT infrastructure evolves over time Systems are created and acquired by different people using different specifications This can happen through: Changes in management Mergers & Acquisitions Externally mandated standards General poor planning This can happen through: Changes in management Mergers & Acquisitions Externally mandated standards General poor planning

This leads to many issues Redundant data across the organization Customer record maintained by accounts receivable and marketing The same data element stored in different formats Social Security number ( versus ) Different naming conventions “Doritos” versus “Frito-Lay’s Doritos” verus “Regular Doritos” Different unique identifiers used Account_Number versus Customer_ID What are the problems with each of these ? What are the problems with each of these ?

What’s the big deal? This is a fundamental problem for creating data cubes We often need to combine information from several transactional databases How do we know if we’re talking about the same customer or product?

Now think about this scenario Hotel Reservation DatabaseCafé Database What are the differences between a “guest” and a “customer”?Is there any way to know if a customer of the café is staying at the hotel?

Solution: “Single view” of data The entire organization understands a unit of data in the same way It’s both a business goal and a technology goal and really more this…..than this

Closer look at the Guest/Customer Guests Guest_number Guest_firstname Guest_lastname Guest_address Guest_city Guest_zipcode Guest_ Customer Customer_number Customer_name Customer_address Customer_city Customer_zipcode Getting to a “single view” of data: How would you represent “name?” What would you use to uniquely identify a guest/customer? Would you include address? How do you figure out if you’re talking about the same person?vs.

Organizational issues Why might there be resistance to data standardization? Is it an option to just “fix” the transactional databases? If two data elements conflict, who’s standard “wins?”

Data Quality The degree to which the data reflects the actual environment Do we have the right data? Is the collection process reliable? Is the data accurate?

Finding the right data Choose data consistent with the goals of analysis Verify that the data really measures what it claims to measure Include the analysts in the design process Adapted from site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Ensuring accuracy Know where the data comes fromManual verification through samplingUse of knowledge expertsVerify calculations for derived measures Adapted from site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Reliability of the collection process Build fault tolerance into the process Check logs (if you can) Periodically run reports and verify results Keep up with (and communicate) changes Adapted from site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf