DATA VALIDATION and ANALYSIS OF PERFORMANCE OF IMPUTATION United Nations Statistics Division
Generic Statistical Business Process-Censuses Planning Questionnaire Mapping Testing Enumeration Data processing Analyzing Dissemination Evaluation Pre-enumeration operations Preliminary evaluation of data quality 2 2
Data validation during data processing Steps of data processing depend on the technology used in general, the process covers the following steps: Preparation Scannning/Data capture Coding Editing/ Imputation Validation Processing control Master file Review and validate data against predefined rules a. Identify potential problems such as missing data, inconsistency and inappropriate editing/ imputation BEFORE PRODUCING CENSUS OUTPUTS
Interpret and explain outputs Data analysis Steps for Data Analysis Prepare outputs Validation of outputs Interpret and explain outputs Apply disclosure control Finalize outputs Checking data quality with appropriate methods, Comparing the statistics with previous censuses and other relevant data sources (both internal and external) Investigating inconsistencies in the statistics
Data validation Checking population distribution by geographic areas Checking the quality of editing/imputation Checking internal consistency and missing data
Ensuring enumerated population is fully processed Data validation-1 Checking population distribution by geographic areas enumerated persons/households may not be fully captured (undercoverage) or double captured (overcoverage) Controlling captured records (people/housing units) with census documents such as: Control forms –prepared by enumerators/supervisors Reports –prepared by Local/Regional Census Committees Number of questionnaires received from the fields-prepared by the head quarters Number of scanned questionnaire-if applicable Ensuring enumerated population is fully processed
Data validation - 2 Checking the quality of editing/imputation Editing rules may be insufficient to identify all types of errors Imputation may introduce new errors in data because of incorrect application Some unexpected patterns may not be identified with editing/consistency rules
Basic definitions Editing: List of rules to determine invalid and inconsistent data Imputation : The process of resolving problems concerning invalid or inconsistent data – and missing values- identified during editing All records must respect a set of editing rules formulated to correct errors and finally disseminate reliable data
Some examples for invalid data Age Equal to 99 Instruction – if it is greater or equal to 98, write 98 If age is written in one digit, such as How to correct? 1 5
Some examples for inconsistent data Children ever born alive, living and dead children If number of children ever-born is not equal to the sum of number of living children and number of dead children Last live birth and household deaths There is an infant birth who is not alive, but no infant death registered in the household deaths Age of father/mother and children If age of father/mother is lower or few years higher than age of a child What will be decision?
Dealing with missing data What are decisions for dealing with missing data: Missing data –item non-response- will be imputed ? What variables will be imputed for missing data ? What methods will be used for imputation?
Assessing the performance of imputation Objectives Comparing the distribution of the observed values with the distribution of the imputed values Comparing the distribution of observed values to the complete distribution including the imputed values To analyze the effect of imputation on original data set To ensure the distribution of imputed values is reasonable or meets with the expected pattern
Assessing the performance of imputation Method for assessing the performance: After implementation of editing/imputation, data should be classified as follows : Observed (consistent) data: the values which meet with all editing rules Non-response or unknown : no value Inconsistent data : the values which failed at least one editing rule Imputed data for inconsistency –and non-response For this analysis, all procedures performed in the database should be identifiable
Assessing the performance of imputation Compare the distribution of the observed values with the distribution of the imputed values if non-response and inconsistent data are distributed randomly, no difference is expected between the distribution of the observed and the imputed values If there are differences between the people who responded and those who did not or not give accurate data The imputed data should not follow the same distribution as the observed data
Assessing the performance of the imputation Compare the distribution of the observed values with the distribution of all values including the imputed values In general, imputed values should have a minimal effect on the distribution of the complete data Unless the non-response rate is particularly high or the bias for certain characteristics
Understanding data editing and potential errors Data on deaths in the household – cases where age of deceased was hot-decked show different age pattern of mortality than cases that were not subject to imputation Indicates that the rules followed by the hotdeck are introducing a bias and are not reliable Source: Estimation of mortality using the 2001 South Africa census data, Rob Dorrington, Tom Moultrie and Ian Timaeus, Centre for Actuarial Research, University of Cape Town
Understanding data editing and potential errors Boundary of school age Boundary of working age Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Table 2: Distribution of bedrooms Observed responses Imputed responses Difference Total Change Number of bedrooms (Imputed-Observed) Including imputed (total-observed) N % (1) (2) (3) (4) (5)=(4)-(2) (6)=(1)+(3) (7) (8)=(7)-(2) 62 0.3 5 0.8 0.5 67 0.014 1 2,378 10.7 124 19.2 8.5 2,502 10.9 0.240 2 6,097 27.4 192 29.8 2.3 6,289 27.5 0.066 3 9,375 42.2 228 35.3 -6.8 9,603 42.0 -0.192 4 3,279 14.7 70 -3.9 3,349 14.6 -0.110 809 3.6 19 2.9 -0.7 828 -0.020 6 166 0.7 0.0 171 0.001 7 39 0.2 40 -0.001 8 or more 27 0.1 28 22,232 100 645 22,877 0.000 Max Change Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Maximum change
Assessing the performance of imputation Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Summary indexes at the variable level Maximum absolute percent change Maximum absolute percent change across all categories for each variable Dissimilarity Index Degree of change of two distributions (observed and total including imputed values) at the variable level Imputation rate Share of the imputed records in the total records
Assessing the performance of imputation Maximum absolute percent change between the observed and final (imputed) distributions across all categories within each of the questions Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Maximum absolute percent change between the observed and final (imputed) distributions across all categories within each of the questions Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Index of dissimilarity To assess the degree of change induced by imputation on the initial distribution of variables Where; k : categories of the variable f : percentage distribution of the variable before imputation f * : percentage distribution of the variable after imputation
Index of dissimilarity 0 ≤ ID ≤ 100 It assumes a 0 value when the two distributions before and after imputation are equal It is greater than 0 when they are different and reaches its maximum value of 100 when there is maximum dissimilarity between the two distributions when both are concentrated in one category which is different from each other
Index of dissimilarity ID 1.9 Source: England and Wales, Office for National Statistics, 2011 Census:Item Edit and Imputation: Evaluation Report, June 2012
Assessing the performance of imputation Source: Albania, Quality Dimensions of 2011 Population and Housing Census, May 2014
Assessing the performance of imputation Source: Albania, Quality Dimensions of 2011 Population and Housing Census, May 2014
Data validation-3 Checking internal consistency Objectives: Ensuring all records meet with editing rules Ensuring there is no unusual/unexpected values
How to validate Prepare tables for preliminary analysis of census results The list of tables should be prepared based on editing rules and relation between variables Tables should present all possible conditions in data without eliminating any category to verify the results for example: Marital status by all age groups, Completed level of education by all age groups Tables should present missing data
Some examples of tables Tables for analyzing age difference between members of households Age interval between father/mother and children At least 12-14 years and at most 65 for males, 50 for females Age interval between grand parents and grand children At least 30 years
Some examples of tables Distribution of household size Accuracy of household size considering the number of persons enumerated in one page– such as 5, 10, … There might be errors in combining the census forms belonging to same household
Some examples of tables CEB, CS and CD Relation between number of children ever-born, number of living children and number of dead children – CEB=CS+CD Relation between age and number of children ever born
CEB – quality assessment Fertility CEB – quality assessment Parities wrong ?
CEB – quality assessment Mongolia, 1989 Census (Source: IPUMS) Parity 15-19 20-24 25-29 30-34 35-39 40-44 45-49 105,548 43,676 9,824 2,711 987 865 726 1 4,827 30,834 15,350 5,432 2,185 1,302 1,488 2 896 17,309 23,960 10,659 4,479 2,217 2,053 3 834 5,382 19,279 11,159 4,923 2,663 1,950 4 199 1,828 11,831 11,922 6,974 3,525 2,658 5 68 477 5,730 11,189 7,426 4,933 3,379 6 53 2,161 7,568 6,348 4,442 3,619 7 25 707 3,737 4,551 3,638 2,977 8 15 23 263 2,355 3,879 3,986 3,706 9 61 119 746 2,190 2,747 3,059 10 419 1,300 2,433 3,253 11 147 743 1,183 1,667 12 22 38 262 845 1,299 13 19 161 403 898 14 20 82 242 392 15+ 72 235 629 Unknown 218 65 58 35 Parities wrong ? Implausible parities – following IUSSP manual, here we will recode them as unknown (which will then be re-distributed based on the El Badry method if appropriate) If imputation or other forms of editing the data were used, the analyst should be aware of this
Age at death of children (in month) declared by the mother, Nepal 1975 Quality assessment Age at death of children (in month) declared by the mother, Nepal 1975
Some examples of tables Education Educational attainment- highest level completed Consistency with school attendance Relation with age –minimum age for completing school Usually it is calculated by taking minimum age for entering school plus number of years required for completing a school. Example: Minimum age for primary education is age 6 If primary education requires 8 years, minimum age for completing primary school would be age 13
School attendance – quality assessment Expected pattern ? Expected pattern ?