Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile
22 Imputation Imputation resolves the problems of missing, invalid or incomplete responses identified during editing
3 Imputation Options Interactive/Manual Subjective imputation Donor based imputation Regression (model) based imputation Imputations can be done manually or automatically 3
4 Interactive/Manual Treatment Manual review of the record Obvious and easily corrected records can be interactively treated at the data capture stage – Ex: in a table formatted input, responses may be accidentally shifted by a row Often a subject matter expert reviews the hard copy/original questionnaire – Errors can be found in questionnaire that are otherwise undiscoverable – Manual imputation procedures, e.g. with historic data Re-contact respondent to correct data
5 Imputation Cells Usually, data is split into imputation cells similar to strata – Example criteria include industry type, geography, employment size, etc. Imputation cells are intended to be relatively homogeneous This ensure that imputations are done within similar respondents 5
6 Subjective Imputation Generally rule or logic based Can be used when there is only one (reasonably) possible response to the question – Ex: balance edit – single missing variable in a balance edit – Ex: rule based – if respondent reports zero months worked, then income can be imputed to be zero Can be used when missing/erroneous values can be determined unambiguously from edits – Ex: rule based – if the ratio of anticipated value (e.g. historic value) to current value is greater than 300, assume a thousands error. Value = 135,000 Previous value = 130
7 Donor Imputation Donor based – replacement by non- erroneous donors – Hot deck – replace with values from the current survey – Cold deck – replace with values from other source (e.g. previous surveys)
8 Donor Imputation – Substitution Historic value – Simple historic value is a cold deck imputation Historic value with trend – Trend can be based on growth in another variable within the record, variables in other records, etc. This is a very common imputation technique Suggestions Useful method when variables or growth rates are stable over time Less useful method when changes in variables are of primary interest – Ex: monthly employment in monthly employment surveys
9 Donor Imputation – Mean/Modal Missing value is replaced by the mean/modal of respondents for a variable (within a subset or imputation cell of similar respondents) – E.g. if wages is missing for one respondent, the average wage within the imputation cell can be used Suggestions Useful method when variance is small within an imputation cell
10 Donor Imputation – Nearest Neighbor For each missing value, find a donor value from a record that is closest to the missing value record based on the distance between a set of variables – E.g. Employees, Additions, Dismissals – Record to be imputed (t): E = 100, A = ?, D = ? – Donor record (s): 1.E = 80, A = 10, D = 5 : Distance = 20 2.E = 90, A = 12, D = 4 : Distance = 10 – Imputed record: E = 100, A = 12, D = 4 10
11 Donor Imputation – Nearest Neighbor(2)
12 Donor Imputation - Ratio Missing values are replaced with a ratio of donor record values – E.g.: T = P + C – Record to be imputed: T = 400, P = ?, C = ? – Donor record T = 100, P = 25, C = 75 – Imputed record T = 400, P = 100, C = 300 The donor can be: – Chosen using a distance function – The mean value within the imputation cell
Donor Imputation – Ratio (2) 13
14 Regression (model-based) Imputation Regression/model – An imputation model predicts a missing or erroneous value using a function of some auxiliary variables – Auxiliary variables can be from the current survey or other sources. E.g. sampling frame (size class, branch of economic activity), historical information (previous period value) – Regression coefficients can be determined from historic survey data
15 Model-based Imputation (2)
16 Imputation Process: Fellegi-Holt An isolated imputation may not satisfy all editing rules Key principle: the data of a record should be made to satisfy all edits by changing the fewest possible number of fields. Solves edit rules simultaneously through linear programming Advantages – Preserves as much original data as possible – Leads to consistent data satisfying all edits Disadvantages – All edits specified for a certain record are considered fatal – Powerful edits are required – Not easy to implement
17 How/Why to choose one method over the other? Depends on specificities of the survey and the available time, cost, expertise, etc. – Ex: a short term survey estimating changes in employment in the manufacturing sector, using historic data for employment would bias the estimate downwards When designing imputation processes, simulations using a variety of imputation techniques should be experimented with Fine tuning of imputation process to survey particulars is necessary