MIS2502: Data Analytics Extract, Transform, Load

Slides:



Advertisements
Similar presentations
Design Validation CSCI 5801: Software Engineering.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
C6 Databases.
© Tally Solutions Pvt. Ltd. All Rights Reserved 1 Stock Audit in Shoper 9 February 2010.
Data Warehouse IMS5024 – presented by Eder Tsang.
Today’s Goals Concepts  I want you to understand the difference between  Data  Information  Knowledge  Intelligence.
Data Mining.
MIS2502: Data Analytics Relational Data Modeling
Agenda 02/20/2014 Complete data warehouse design exercise Finish reconciled data warehouse, bus matrix and data mart Display each group’s work Discuss.
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
MIS2502: Data Analytics Extract, Transform, Load
DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,
Introduction to database systems
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
© 2007 by Prentice Hall 1 Introduction to databases.
GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Building Marketing Databases. In-House or Outside Bureau? Outside Bureau: Outside agency that specializes in designing and developing customized databases.
© All Rights Reserved Module Information and the Organisation Well Designed Interfaces.
AvaTax ST The fastest, easiest, most accurate and affordable way to calculate and file sales and use taxes.
7 Strategies for Extracting, Transforming, and Loading.
MIS2502: Data Analytics Relational Data Modeling
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
CHAPTER 1 – INTRODUCTION TO ACCESS Akhila Kondai September 30, 2013.
Oracle’s EPM System and Strategy
MIS2502: Data Analytics Relational Data Modeling David Schuff
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 1 Database Systems.
Project management Topic 8 Configuration Management.
AOIT Database Design Unit 3, Lesson 9 Data Integrity Copyright © 2009–2011 National Academy Foundation. All rights reserved.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Jaclyn Hansberry MIS2502: Data Analytics The Things You Can Do With Data The Information Architecture of an Organization Jaclyn.
Introduction To DBMS.
Client/Server Databases and the Oracle 10g Relational Database
MIS2502: Data Analytics Relational Data Modeling
Overview of MDM Site Hub
Data warehouse and OLAP
Processing Integrity and Availability Controls
MIS5101: Business Intelligence Relational Data Modeling
MIS5101: Extract, Transform, Load (ETL)
Lecture Outline 4 Transaction Systems and Beyond
MIS2502: Data Analytics Dimensional Data Modeling
What are the Database Applications
MIS5101: Data Analytics The Things You Can Do With Data
MIS5101: Business Intelligence Outcomes Measurement and Data Quality
MIS5101: Extract, Transform, Load (ETL)
MIS2502: Data Analytics Extract, Transform, Load
MIS5101: Extract, Transform, Load (ETL)
MIS2502: Data Analytics Relational Data Modeling
Systems analysis and design, 6th edition Dennis, wixom, and roth
MIS2502: Data Analytics The Information Architecture of an Organization David Schuff
MIS2502: Data Analytics The Information Architecture of an Organization Acknowledgement: David Schuff.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Systems Analysis and Design
MIS2502: Data Analytics The Information Architecture of an Organization Aaron Zhi Cheng Acknowledgement:
MIS2502: Data Analytics The Things You Can Do With Data
MIS2502: Data Analytics Relational Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
MIS2502: Data Analytics Extract, Transform, Load
MIS2502: Data Analytics Relational Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
Data Quality: Why it Matters
Comparative Reporting & Analysis (CR&A)
MIS2502: Data Analytics Relational Data Modeling
The ultimate in data organization
Data Warehousing Concepts
Introducing a Database
MIS2502: Data Analytics Data Extract, Transform, Load(ETL)
Valuing Organizational Information
Instructor Materials Chapter 5: Ensuring Integrity
Presentation transcript:

MIS2502: Data Analytics Extract, Transform, Load Acknowledgement: David Schuff

Where we are… Now we’re here… Data entry Transactional Database Data extraction Analytical Data Store Data analysis Stores real-time transactional data Stores historical transactional and summary data

Extract, Transform, Load (ETL) Extract data from the transactional database Transform data into an analysis-ready format Load it into the analytical data store

The Actual Process Transactional Database 1 Data conversion Extract Transform Load Transactional Database 1 Data conversion Transactional Database 2 Analytical Data store Data conversion Other Sources Data conversion Relational database Dimensional database

ETL’s Not That Easy! Data Consistency Data Quality What if the data is in different formats? Data Consistency How do we know it’s correct? What if there is missing data? What if the data we need isn’t there? Data Quality

Data Consistency: The Problem with Legacy Systems An IT infrastructure evolves over time Systems are created and acquired by different people using different specifications This can happen through: Changes in management Mergers & Acquisitions Externally mandated standards Generally poor planning

Why Not Replacing Legacy Systems? Too much risk Prohibitive cost User reluctance Limited business agility Speed of delivery https://www.onbase.com/~/media/Files/hyland/whitepaper/wp_trouble-with-legacy-systems.pdf

Problems with Data Consistency The same data element stored in different formats Social Security number (123-45-6789 versus 123456789) Date (10/9/2015 versus 9/10/2015) Redundant data across the organization Customer record maintained by accounts receivable and marketing Different naming conventions “Management Information Systems” versus “MIS” versus “Man. Info. Sys.” Different unique identifiers used AccessNet account versus Temple ID What are the problems with each of these ?

What’s the big deal? This is a fundamental problem for creating the analytical data store We often need to combine information from several transactional databases How do we know if we’re talking about the same customer or product?

Now think about this scenario Hotel Reservation Database Café Database What are the differences between a “guest” and a “customer”? Is there any way to know if a customer of the café is staying at the hotel?

Solution: “Single view” of data The entire organization understands a unit of data in the same way It’s both a business goal and a technology goal but it’s really more this… ...than this

Organizational issues Why might there be resistance to data standardization? Is it an option to just “fix” the transactional databases? If two data elements conflict, who’s standard “wins?”

Closer look at the Guest/Customer Guests Guest_number Guest_firstname Guest_lastname Guest_address Guest_city Guest_zipcode Guest_email Customer Customer_number Customer_name Customer_address Customer_city Customer_zipcode vs. Getting to a “single view” of data: How would you represent “name?” What would you use to uniquely identify a guest/customer? Would you include email address? How do you figure out if you’re talking about the same person?

Data Transformation Steps Decomposes data elements Example: [name: Joe Cool ]→[FirstName: Joe, LastName: Cool) Parsing Corrects parsed data elements Example: street name does not exist and is replaced with the "closest" one Correcting Transforms data into its preferred format Example: Broad ST → Broad Street Standardizing Matches records within and across data sources Matching

Data Quality The degree to which the data reflects the actual environment Do we have the right data? Is the collection process reliable? Is the data accurate?

Finding the right data Choose data consistent with the goals of analysis Verify that the data really measures what it claims to measure Include the analysts in the design process Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/ site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Ensuring accuracy Know where the data comes from Manual verification through sampling Use of knowledge experts Verify calculations for derived measures Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/ site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Reliability of the collection process Build fault tolerance into the process Periodically run reports, check logs, and verify results Keep up with (and communicate) changes Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/ site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Summary What is ETL? Why is it important? Data consistency Data quality Explain the purpose of each component (Extract, Transform, Load)

Next Excel Basics Assignment: ETL process on an Excel workbook You will be: Extracting the data from source worksheets. Transforming the data using Excel formulas. Loading the data into a new worksheet that contains a single set of combined data.