STAT 490DS1 Data Quality
Data Quality First Step – Data Validation Second Step – Data Cleansing Identify the errors in the Data Second Step – Data Cleansing Steps taken to deal with errors
Data Validation Types of Errors Other Issues Missing Data Inconsistent Values Duplicate Data Inaccurate Data Other Issues Outliers
Missing Values Common error Generally easily identified Data was not collected Gender Smoking Class Refused to provide Weight Height Generally easily identified
Missing Data – Resolution Eliminate the Data Record May be acceptable for a few records Lose data from eliminated records Not acceptable if many records are missing data Probably not reasonable in mortality or lapse study
Missing Data – Resolution Estimate the Missing Value Averages Use other data in the record to approximate Use external data Use value from previous record Ignore the Missing Value Do not include the missing attribute in analysis
Inconsistent Data Common error Example – Birthdate and age are inconsistent Jeff was born 11/29/1955 and was age 62 on September 5, 2019 Generally result of entry error Identify from redundant data or outside data
Inconsistent Data - Resolution Use redundant data to correct Utilize outside data set Example - Birthdate from Social Security
Duplicate Record Common Error May have slightly different data Search on key attributes to identify duplicates Careful not to eliminate data that should be included A person appears in death data multiple times Could have multiple policies
Duplicate Record – Resolution Eliminate duplicate records while maximizing data retained
Inaccurate Data Generally the most difficult to identify Unreasonable values Negative ages or death benefit Ages over 90 or 100 May not be wrong but should be flagged for review A birthdate with a month of 13 Often entry error
Inaccurate Data – Resolution Attempt to correct data Many of same techniques as Missing Data Utilize any redundant data Utilize sources used to identify error
Outliers Outliers are not errors Still cause problem as it may skew analysis Example Company has retention limit of 1,000,000 Has death claim of 10,000,000 Really only cost company 1,000,000 May be more appropriate to include as 1,000,000