STAT 490DS1 Data Quality.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Statistical Studies: Statistical Investigations
Notes  Data are presented as a pair of overlying bars, the outer, wider bar representing the period 1st Oct 2007 to 30th September 2008, and the inner,
1 Wireless Warehouse Management System Compsee’s M.A.T. Mobile Application Terminal.
Collaborative Research Assistant 2007 Family History Technology Conference John Finlay Christopher Stolworthy Daniel Parker.
CHAPTER 1 LESSON 3 Math in Science.
Adjusting Erroneous Wage Record Match Results. 2 Agenda Background on the use of wage record data for tracking employment and earningsBackground on the.
 So far in ICT we’ve covered how data is entered into computers (data capture) and how it’s checked (validation and verification).  In this section.
Price Support (Vendor Rebates) Outside Vendor Window Black Belt: Debbi Youmans-McCreary Division: INA - eIKON BU – Ops/Vendor Credits/Southeast.
YOUR EXTERNAL AND INTERNAL CUSTOMERS
Flat Files Relational Databases
CHAPTER 7 STATISTICAL PROCESS CONTROL. THE CONCEPT The application of statistical techniques to determine whether the output of a process conforms to.
24 Nov 2007Data Management and Exploratory Data Analysis 1 Yongyuth Chaiyapong Ph.D. (Mathematical Statistics) Department of Statistics Faculty of Science.
FatMax Licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 LicenseCreative Commons Attribution-NonCommercial-ShareAlike 2.5.
Please read this before using presentation This presentation is based on content presented at the 2015 Mines Safety Roadshow.
Multiplication Find the missing value x __ = 32.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Copyright © 2013 Dorling Kindersley (India) Pvt. Ltd. Management Information Systems: Managing the Digital Firm, 12eAuthors: Kenneth C. Laudon and Jane.
16BA608/FINANCIAL MANAGEMENT
Life Insurance Why do people buy life insurance?
Inference about a Population Mean
Databases and DBMSs Todd S. Bacastow January
Immigration – Common Errors and How To Conduct An Internal I-9 Audit
Control Charts Definition:
Course Outline 1. Pengantar Data Mining 2. Proses Data Mining
Multiple Choice Review
Introduction to fertility
Data Analysis.
Introduction Previous lessons have demonstrated that the normal distribution provides a useful model for many situations in business and industry, as.
Employment Drug Testing
Introduction to Sampling Distributions
Clinical Study Results Publication
Recommended Budget Reductions
Governance Assistant for Office365
1002 Individual Animal Transactions in ZIMS
Presented to:- Dr. Dibyojyoti Bhattacharjee
Data Quality By Suparna Kansakar.
SQL for Cleaning Data Farrokh Alemi, Ph.D.
Mike Strom, Wyoming State Archives
GIS-Based Subsurface Exploration System
AP Statistics Exam Overview.
DRQ #5 – September 24, 2009 Aquarius Products, Inc, has just completed development of a new line of skin-care products. Preliminary market research indicates.
12/27/2018 Selection Decisions.
CH2. Cleaning and Transforming Data
TDW-12: 23-27th April 2018, Noumea, New Caledonia
Data Understanding, Cleaning, Transforming
Introduction Previous lessons have demonstrated that the normal distribution provides a useful model for many situations in business and industry, as.
Lecture 6: Data Quality and Pandas
Mean, Median, Mode The Mean is the simple average of the data values. Most appropriate for symmetric data. The Median is the middle value. It’s best.
WARM - UP Is the coin Fair?
Selecting the Right Predictors
Data Quality Data Exploration
General External Merge Sort
Lecture 1: Descriptive Statistics and Exploratory
Non-parametric Filters: Particle Filters
Entering Records.
The Six-Column Work Sheet
Non-parametric Filters: Particle Filters
CSEPs SPA Rejections December 2007 update
Databases and Information Management
Treatment of Missing Data Pres. 8
Data Pre-processing Lecture Notes for Chapter 2
Our company is hiring, do you know anyone that could help us out?
Sports & Entertainment Marketing I 2.04
Formulae and expressions
Accuracy and Precision
Utilizing the GRO and the Free BMD
Prepared By: Mr. Prashant S. Kshirsagar (Sr.Manager-QA dept.)
After the Count: Data Entry and Cleaning
Presentation transcript:

STAT 490DS1 Data Quality

Data Quality First Step – Data Validation Second Step – Data Cleansing Identify the errors in the Data Second Step – Data Cleansing Steps taken to deal with errors

Data Validation Types of Errors Other Issues Missing Data Inconsistent Values Duplicate Data Inaccurate Data Other Issues Outliers

Missing Values Common error Generally easily identified Data was not collected Gender Smoking Class Refused to provide Weight Height Generally easily identified

Missing Data – Resolution Eliminate the Data Record May be acceptable for a few records Lose data from eliminated records Not acceptable if many records are missing data Probably not reasonable in mortality or lapse study

Missing Data – Resolution Estimate the Missing Value Averages Use other data in the record to approximate Use external data Use value from previous record Ignore the Missing Value Do not include the missing attribute in analysis

Inconsistent Data Common error Example – Birthdate and age are inconsistent Jeff was born 11/29/1955 and was age 62 on September 5, 2019 Generally result of entry error Identify from redundant data or outside data

Inconsistent Data - Resolution Use redundant data to correct Utilize outside data set Example - Birthdate from Social Security

Duplicate Record Common Error May have slightly different data Search on key attributes to identify duplicates Careful not to eliminate data that should be included A person appears in death data multiple times Could have multiple policies

Duplicate Record – Resolution Eliminate duplicate records while maximizing data retained

Inaccurate Data Generally the most difficult to identify Unreasonable values Negative ages or death benefit Ages over 90 or 100 May not be wrong but should be flagged for review A birthdate with a month of 13 Often entry error

Inaccurate Data – Resolution Attempt to correct data Many of same techniques as Missing Data Utilize any redundant data Utilize sources used to identify error

Outliers Outliers are not errors Still cause problem as it may skew analysis Example Company has retention limit of 1,000,000 Has death claim of 10,000,000 Really only cost company 1,000,000 May be more appropriate to include as 1,000,000