Review: Simulation (see handout) Teradata & SQL accounts LMS GIGO

Slides:



Advertisements
Similar presentations
C6 Databases.
Advertisements

Navigator Management Partners LLC Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business.
Computer Concepts 5th Edition Parsons/Oja Page 492 CHAPTER 10 File And Database Concepts Section A PARSONS/OJA Databases.
Business Intelligence Michael Gross Tina Larsell Chad Anderson.
Database Management: Getting Data Together Chapter 14.
Business Driven Technology Unit 2
DATA QUALITY PROBLEMS AND THEIR ROOT CAUSES DAMA COLUMBUS, OH CHAPTER MEETING – JANUARY 2015.
Troy Eversen | 19 May 2015 Data Integrity Workshop.
Copyright © 2006, SAS Institute Inc. All rights reserved. Data at its Best How to keep large data volumes in order and ensure high quality ? Milen Georgiev.
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Chapter 1 Introduction to Data Quality. Data Quality Characteristics Data quality affects several attributes associated with data: Accuracy–Is it realistic.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Ahsan Abdullah 1 Data Warehousing Lecture-17 Issues of ETL Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
2005 SPRING CSMUIntroduction to Information Management1 Organizing Data John Sum Institute of Technology Management National Chung Hsing University.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Databases and Data Warehouses: Supporting the Analytics-Driven.
MIS 301 Information Systems in Organizations Dave Salisbury ( )
311: Management Information Systems Database Systems Chapter 3.
© 2007 by Prentice Hall 1 Introduction to databases.
Chapter 3 and Module C DATABASES AND DATA WAREHOUSES Building Business Intelligence.
Introduction to Databases Trisha Cummings. What is a database? A database is a tool for collecting and organizing information. Databases can store information.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
CHAPTER 3 DATABASES AND DATA WAREHOUSES. 2 OPENING CASE STUDY Chrysler Spins a Competitive Advantage with Supply Chain Management Software Chapter 2 –
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Data Warehouse. Group 5 Kacie Johnson Summer Bird Washington Farver Jonathan Wright Mike Muchane.
Databases and Information Management Chapter 6. Outline Database Relational Database Database Management System (DBMS) Structured Query Language Data.
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Chapter 3 Databases and Data Warehouses: Building Business Intelligence Copyright © 2010 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
McGraw-Hill/Irwin ©2009 The McGraw-Hill Companies, All Rights Reserved CHAPTER 6 DATABASES AND DATA WAREHOUSES CHAPTER 6 DATABASES AND DATA WAREHOUSES.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
0 / Database Management. 1 / Identify file maintenance techniques Discuss the terms character, field, record, and table Describe characteristics.
MS ACCESS How and Why Second Semester First Quarter Project One.
Carnegie Mellon University © Robert T. Monroe Management Information Systems Data Warehousing Management Information Systems Robert.
Data Warehousing 101 Howard Sherman Director – Business Intelligence xwave.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
Review: ETL & OLAP What is transposition? What is ETL? Script/batch/command vs manual? What is the major cause/source of bad data? GIGO: KYD: If you didn’t.
GCSE ICT LESSON 5 Booklet Sections: 6 & 7 Data Capture & Checking Data.
Bartek Doruch, Managing Partner, Kamil Karbowiak, Managing Partner, Using Power BI in a Corporate.
Accessing Organizational Information
CHAPTER SIX DATA Business Intelligence
Tools Of Structured Analysis
TRANSACTION PROCESSING SYSTEM (TPS)
Normalization Karolina muszyńska
Overview of MDM Site Hub
DQS: Business Logic Meets Enterprise Integration
What is an attribute? How is it related to an entity?
RELATIONAL DATABASE MODEL
Database Design Using Normalization
Business Intelligence for Project Server/Online
Sales Order Process.
Data Analysis.
Chapter 1 Database Systems
Overview of Transaction Processing and Enterprise Resource Planning Systems Chapter 2.
CHAPTER SIX OVERVIEW SECTION 6.1 – DATABASE FUNDAMENTALS
Looking at the Quality of Data and Information
Business Intelligence
Data Warehouse.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Chapter 1 Database Systems
Chapter 17 Designing Databases
Data Warehousing Concepts
Valuing Organizational Information
Databases and Information Management
CHAPTER 6 Testing and Debugging.
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Information system analysis and design
Presentation transcript:

Review: Simulation (see handout) Teradata & SQL accounts LMS GIGO

Introduction to Data Quality & Data Cleaning Data and information is not static, it flows in a data collection and usage process Data gathering Data delivery Data storage Data integration Then what?

Data Quality Characteristics Data quality affects several attributes associated with data: Accuracy – Is it realistic or believable? Integrity – Is it structured and managed? Consistency – Is it consistently defined and maintained? Validity – Is the data valid, based on business or industry rules and standards?

What Causes Poor Data Quality? These factors can contribute to poor data quality: Business rules do not exist or there are no standards for data capture. Standards may exist but are not enforced at the point of data capture. Inconsistent data entry (incorrect spelling, use of nicknames, middle names, or aliases) occurs. Data entry mistakes (character transposition, misspellings, and so on) happen. Integration of data from systems with different data standards is present. Data quality issues are perceived as time-consuming and expensive to fix. There are many attributes which can contribute to poor data quality, including: A lack of business rules (or standards) at the point of data capture. **This is the cheapest place to deal with data quality. If it were dealt with at the point of data capture, it would not have to be included “downstream” during the data warehousing or decision support processes. Standards not being enforced at the point of data capture. This is extremely prevalent in situations where you have divisions and departments all running their own systems, and none can agree on what standards should be used/enforced when capturing data. Inconsistencies in data entry (across different users, different departments, or even across multiple operational systems) Common data entry mistakes Pulling data together from multiple systems, each with a different standard. This is an issue with pulling in historical data that may have been entered via an old legacy system. Also, with mergers and acquisitions going on these days, it has become more and more of an issue. The misperception that data quality issues are time consuming and expensive to fix.

Primary Sources of Data Quality Problems This chart represents the percentage of data quality problems that are caused by these sources. Why do these problems happen? Typos and misspellings when data is being entered by employees. **Sometimes even valid phone numbers and addresses are not necessarily correct! Changes in source systems that cause problems when the data is propagated downstream in the warehousing process. Data migration and conversion of data from multiple operational sources. The data was entered with different standards, and therefore does not necessarily merge together cleanly. Differing levels of expectations and standards by the users. What is good enough for one user may not necessarily be good enough for another one. **This is especially common in a situation where you have many different groups of users within an organization using the same data, but each with its own business issues and concerns (i.e. how do you define a customer, supplier, product, etc.). Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002

How Is Clean Data Achieved? Clean data is the result of a combination of efforts: making sure that data entered into the system is clean GIGO cleaning up problems after the data is accepted.

Some errors are easier to fix than others… If we have data for Male and Female (‘M’,‘F’) then is it a reasonable assumption that ‘f’ is supposed to be ‘F’? How about age, is 135 valid? Can we fix it? What if we had DOB? What about subject codes: BI is 306622, what if we find 306662, can we fix it? This is a common data entry problem called transposition

Analysis and Standardization Example Who is the biggest supplier? Anderson Construction $ 2,333.50 Briggs,Inc $ 8,200.10 Brigs Inc. $12,900.79 Casper Corp. $27,191.05 Caspar Corp $ 6,000.00 Solomon Industries $43,150.00 The Casper Corp $11,500.00 The example shows that the vendor Briggs Incorporated has two incomplete expenditure summations and Casper Corporation has three incomplete expenditure summations (with one entry listed near the end of the report as The Casper Corp). This is a sample from 27,000 entries. ... ...

Standardization Scheme Briggs, Inc  Brigs Inc.  Briggs Inc. Casper Corp.  Caspar Corp  The Casper Corp  Casper Corp. Using dfPower Studio to identify the different permutations of each company name, you can build a scheme and use it to standardize the company name so that each vendor has one unique representation within the database. ... ...

Supplier Spending $ Spent 50,000 40,000 30,000 20,000 10,000 10,000 20,000 30,000 40,000 50,000 $ Spent Casper Corp. Solomon Ind. Briggs Inc. Anderson Cons. The Cost by Vendor report will be more organized, useful and accurate, because vendor expenditure data is now consolidated. A section of the final report appears to the left.

Data Matching Example ... Operational System of Records Data Warehouse Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark.Craver@sas.com M Craver Systems Engineer 01 Mark Carver SAS SAS Campus Drive Cary, N.C. 02 Mark W. Craver Mark.Craver@sas.com Consider the situation where you have records representing an employee stored across 3 different operational systems. The goal is to merge the records together into one master record for the employee. The problem is that the employee’s name is stored slightly differently in each of the data sources. An attempt to merge the records together based on employee name does not yield the desired results. 03 M Craver Systems Engineer SAS ... ...

Data Quality Process ... Operational System of Records Data Warehouse Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark.Craver@sas.com Mark Craver Systems Engineer 01 Mark Craver Systems Engineer SAS SAS Campus Drive Cary, N.C. 27513 Mark.Craver@sas.com DQ If, however, we were able to add a data cleansing process that applied fuzzy logic to the records such that they can be merged, we can attain the desired results. This is the type of functionality the algorithms in the SAS Data Quality Solution make possible. ... ...

ETL What is ETL? E T L How important is it in BI? Exercise one: data cleaning with SAS an MS Access Run SAS 9.1 and follow the demo, answer the questions as you go Exercise two: OLAP When you are finished you can try eTrainer

Exercise two: OLAP Answer the following on the back of the handout. Query 1: How many Sales Districts are there? Which district is consistently the top performer? Which quarter had the best revenue (use totals)? What’s the total revenue for 2004? Which region would you be concerned about and why? What’s the total revenue for 2003? Query 2: How many Sales Regions are there? How many Sales Districts in the USA? How many Sales Reps in NE USA? How many Sales Reps in the USA? Best and Worst Sales Reps? Query 3: Graph the results like this, what’s missing? How can you fix this? What’s happened in Q1 2005?

Query 1

Query 4 Create this report (Sales Reps by Revenue) Look for the SQL that is created and run for you, copy and save it (in a text file)