MIS2502: Data Analytics Data Extract, Transform, Load(ETL)

Slides:



Advertisements
Similar presentations
Module 8 Importing and Exporting Data. Module Overview Transferring Data To/From SQL Server Importing & Exporting Table Data Inserting Data in Bulk.
Advertisements

C6 Databases.
© Tally Solutions Pvt. Ltd. All Rights Reserved 1 Stock Audit in Shoper 9 February 2010.
Data Warehouse IMS5024 – presented by Eder Tsang.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Data Staging Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential Chair of.
Introduction to Systems Analysis and Design
Agenda 02/20/2014 Complete data warehouse design exercise Finish reconciled data warehouse, bus matrix and data mart Display each group’s work Discuss.
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
MIS2502: Data Analytics Extract, Transform, Load
Kenny Trytek Joe Briggie Abby Birkett Derek Woods Advisor: Simanta Mitra Client: Matt Good, Kingland Systems.
Ch.3 Data, Text, and Document Management
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
© 2007 by Prentice Hall 1 Introduction to databases.
GETTING THE DATA INTO THE WAREHOUSE: EXTRACT, TRANSFORM, LOAD MIS2502 Data Analytics.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
XP Chapter 1 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Preparing To Automate Data Management Chapter 1 “You.
MIS2502: Data Analytics The Information Architecture of an Organization.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Building Marketing Databases. In-House or Outside Bureau? Outside Bureau: Outside agency that specializes in designing and developing customized databases.
INFORMATION MANAGEMENT Unit 2 SO 4 Explain the advantages of using a database approach compared to using traditional file processing; Advantages including.
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Flat Files Relational Databases
MIS 451 Building Business Intelligence Systems Data Staging.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
Data Mining & OLAP What is Data Mining? Data Mining is the set of activities used to find new, hidden, or unexpected patterns in data.
Jaclyn Hansberry MIS2502: Data Analytics The Things You Can Do With Data The Information Architecture of an Organization Jaclyn.
Intro to MIS – MGS351 Databases and Data Warehouses
Chapter 6: Securing the Cloud
CHAPTER SIX DATA Business Intelligence
CIS 336 AID Your Dreams Our Mission/cis336aid.com
Managing Multi-User Databases
Client/Server Databases and the Oracle 10g Relational Database
Data Staging Data Staging Legacy System Data Warehouse SQL Server
MIS2502: Data Analytics Relational Data Modeling
Overview of MDM Site Hub
Advantages of ICT over Manual Methods of Processing Data
MIS5101: Business Intelligence Relational Data Modeling
MIS5101: Extract, Transform, Load (ETL)
Data Warehouse.
De-mystifying Big Data Testing using new generation tools / technology
Databases and Data Warehouses Chapter 3
MIS5101: Business Intelligence Outcomes Measurement and Data Quality
Database Management System (DBMS)
MIS5101: Extract, Transform, Load (ETL)
Chapter 1 Database Systems
MIS2502: Data Analytics Extract, Transform, Load
MIS5101: Extract, Transform, Load (ETL)
MIS2502: Data Analytics Relational Data Modeling
Introduction to Information Systems
MIS2502: Data Analytics The Information Architecture of an Organization Acknowledgement: David Schuff.
MIS2502: Data Analytics The Information Architecture of an Organization Aaron Zhi Cheng Acknowledgement:
MIS2502: Data Analytics Extract, Transform, Load
Introduction to Databases
MIS2502: Data Analytics Extract, Transform, Load
MIS2502: Data Analytics Dimensional Data Modeling
Chapter 1 Database Systems
MIS2502: Data Analytics Relational Data Modeling
Data Warehousing Concepts
MIS2502: Data Analytics Things You Can Do With Data & Information Architecture Zhe (Joe) Deng
MIS2502: Data Analytics MySQL and MySQL Workbench
Business Intelligence
Analytics, BI & Data Integration
Instructor Materials Chapter 5: Ensuring Integrity
Presentation transcript:

MIS2502: Data Analytics Data Extract, Transform, Load(ETL) Zhe (Joe) Deng deng@temple.edu http://community.mis.temple.edu/zdeng

Where we are… Now we’re here… Data entry Transactional Database Data extraction Analytical Data Store Data analysis Series of tables stored in a relational database Stored as structured or unstructured data in a variety of formats.

The tools we use in this course R and RStudio Data entry Transactional Database Data extraction Analytical Data Store Data analysis MySQL Server & MySQL Workbench Plain-text file (csv, JSON, XML) Tableau Prep

Extract, Transform, Load (ETL) Extract data from the operational data store Transform data into an analysis-ready format Load it into the analytical data store

The Actual Process Transactional Database 1 Data conversion Extract Transform Load Transactional Database 1 Data conversion Transactional Database 2 Data conversion Analytical Data store Other Sources Data conversion Relational database Dimensional database

ETL is Not That Easy! Data Consistency Data Quality What if the data is in different formats? Data Consistency How do we know it’s correct? What if there is missing data? What if the data we need isn’t there? Data Quality

Data Consistency: The Problem with Legacy Systems Legacy System: a previous or outdated computer system still in use US Department of Defense computers used to operate nuclear weapons still running on computers from the 1970s at a cost of billions of dollars each year to taxpayers Most car rental companies are still using MS-DOS to manage rental transactions

Data Consistency: Legacy Systems Example

Data Consistency: Legacy Systems Example

Data Consistency: The Problem with Legacy Systems An IT infrastructure evolves over time Systems are created and acquired by different people using different specifications This can happen through: Changes in management Mergers & Acquisitions Externally mandated standards Generally poor planning

Why Not Replacing Legacy Systems? Too much risk Prohibitive cost User reluctance Limited business agility Speed of delivery Source: https://www.onbase.com/~/media/Files/hyland/whitepaper/wp_trouble-with-legacy-systems.pdf

This leads to many issues Redundant data across the organization Customer record maintained by accounts receivable and marketing The same data element stored in different formats Social Security number (123-45-6789 versus 123456789) Different naming conventions “Doritos” versus “Frito-Lay’s Doritos” verus “Regular Doritos” Different unique identifiers used Account_Number versus Customer_ID What are the problems with each of these ?

What’s the big deal? This is a fundamental problem for creating data cubes We often need to combine information from several transactional databases How do we know if we’re talking about the same customer or product?

Now think about this scenario Hotel Reservation Database Café Database What are the differences between a “guest” and a “customer”? Is there any way to know if a customer of the café is staying at the hotel?

Solution: “Single view” of data The entire organization understands a unit of data in the same way It’s both a business goal and a technology goal but it’s really more this… ...than this

Closer look at the Guest/Customer Guests Guest_number Guest_firstname Guest_lastname Guest_address Guest_city Guest_zipcode Guest_email Customer Customer_number Customer_name Customer_address Customer_city Customer_zipcode vs. Getting to a “single view” of data: How would you represent “name?” What would you use to uniquely identify a guest/customer? Would you include email address? How do you figure out if you’re talking about the same person?

Data Transformation Steps Decomposes data elements Example: [name: Joe Cool ] → [FirstName: Joe, LastName: Deng) Parsing Corrects parsed data elements Example: street name does not exist and is replaced with the "closest" one Correcting Transforms data into its preferred format Example: Broad ST → Broad Street Standardizing Matches records within and across data sources Matching

Data Quality The degree to which the data reflects the actual environment Do we have the right data? Is the collection process reliable? Is the data accurate? Choose data consistent with the goals of analysis Verify that the data really measures what it claims to measure Include the analysts in the design process Know where the data comes from Manual verification through sampling Use of knowledge experts Verify calculations for derived measures Build fault tolerance into the process Periodically run reports, check logs, and verify results Keep up with (and communicate) changes Adapted from http://www2.ed.gov/about/offices/list/os/technology/plan/2004/ site/docs_and_pdf/Data_Quality_Audits_from_ESP_Solutions_Group.pdf

Summary What is ETL? Why is it important? Data consistency Data quality Explain the purpose of each component (Extract, Transform, Load)

Time for our 7th ICA!