Imputation in UNECE Statistical Databases: Principles and Practices Steven Vale and Heinrich Brüngger, UNECE Statistical Division
Contents The ECOSOC view of statistical imputation Current practices Basic principles Step-by-step implementation Conclusions and open questions 14 November 2018
ECOSOC views Resolution 2006/6 on strengthening statistical capacity Sets limits for the use of imputation ... but also implicitly endorses it as a statistical technique Statistical agencies need to review their practices to ensure compliance
Defining imputation “A procedure for entering a value for a specific data item where the response is missing or unusable” Boundary issues: Imputing and editing Imputing and forecasting
Current practice in UNECE Very limited ad-hoc imputation Four cases: Account identities Regional aggregates Poor quality national data with little impact on region totals Re-classification Using imputations from others Sufficient transparency in source metadata?
Basic principles (1) Imputed national data are not published Avoids the need for consultation Only official sources used for imputation Preference for data from same country Clear distinction between “real” and imputed data Transparency – imputed data clearly flagged, and methods documented
Basic principles (2) Aggregates must contain > 90% “real” data, covering > 50% of countries Imputed data are re-calculated periodically to adjust for revisions Method used defined at the level of the variable and stored as an attribute Decisions on the use of imputation to be taken with regard to the quality framework
Step-by-step application Automatic imputation routines to extend imputation towards the boundaries set by the ECOSOC Resolution One step at a time, with pause and review to consider quality and cost / benefit “Dashboard” to allow statisticians to choose the most appropriate method Implemented in the context of re-engineering of statistical database system
First step Use a linear trend to impute missing values Requirements: Sufficient time series observations (at least 3 out of previous 5 periods) Closeness of fit of linear trend (R2 close to 1) Constraints Validity of R2 for few observations Forward imputation only
2000 2001 2002 2003 2004 2005 2006 2007 N Y Data Available: Y = Yes N = No Imputation: = Yes = No
Next steps More flexibility: Longer time series Imputing values at start and in middle of time series Non-linear trends? Cross-country imputation in strictly limited cases?
Conclusions Strong links between imputation and quality Trade-off between accessibility and accuracy Step-by-step, pause and review approach seems appropriate Transparency is essential Standardization of practices between international organizations would help
Open questions Are other organizations interested in defining a common policy on the use of imputation, in response to the ECOSOC Resolution? Could we go further and consider harmonization of methods and tools? How should this be done? Is a specific forum needed, or can this be dealt with in combination with work on data quality? Have other organizations modified their policies on imputation in the light of the ECOSOC Resolution, and if so, how?