Download presentation
Presentation is loading. Please wait.
1
STAT 490DS1 Data Quality
2
Data Quality First Step – Data Validation Second Step – Data Cleansing
Identify the errors in the Data Second Step – Data Cleansing Steps taken to deal with errors
3
Data Validation Types of Errors Other Issues Missing Data
Inconsistent Values Duplicate Data Inaccurate Data Other Issues Outliers
4
Missing Values Common error Generally easily identified
Data was not collected Gender Smoking Class Refused to provide Weight Height Generally easily identified
5
Missing Data – Resolution
Eliminate the Data Record May be acceptable for a few records Lose data from eliminated records Not acceptable if many records are missing data Probably not reasonable in mortality or lapse study
6
Missing Data – Resolution
Estimate the Missing Value Averages Use other data in the record to approximate Use external data Use value from previous record Ignore the Missing Value Do not include the missing attribute in analysis
7
Inconsistent Data Common error
Example – Birthdate and age are inconsistent Jeff was born 11/29/1955 and was age 62 on September 5, 2019 Generally result of entry error Identify from redundant data or outside data
8
Inconsistent Data - Resolution
Use redundant data to correct Utilize outside data set Example - Birthdate from Social Security
9
Duplicate Record Common Error
May have slightly different data Search on key attributes to identify duplicates Careful not to eliminate data that should be included A person appears in death data multiple times Could have multiple policies
10
Duplicate Record – Resolution
Eliminate duplicate records while maximizing data retained
11
Inaccurate Data Generally the most difficult to identify
Unreasonable values Negative ages or death benefit Ages over 90 or 100 May not be wrong but should be flagged for review A birthdate with a month of 13 Often entry error
12
Inaccurate Data – Resolution
Attempt to correct data Many of same techniques as Missing Data Utilize any redundant data Utilize sources used to identify error
13
Outliers Outliers are not errors
Still cause problem as it may skew analysis Example Company has retention limit of 1,000,000 Has death claim of 10,000,000 Really only cost company 1,000,000 May be more appropriate to include as 1,000,000
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.