Eurostat Statistical Data Editing and Imputation.

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007.
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Non response and missing data in longitudinal surveys.
Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
Treatment of missing values
CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Editing and Imputing VAT Data for the Purpose of Producing Mixed- Source Turnover Estimates Hannah Finselbach and Daniel Lewis Office for National Statistics,
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
1 Editing Administrative Data and Combined Data Sources Introduction.
How to deal with missing data: INTRODUCTION
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Edit and Imputation of the 2011 Abu Dhabi Census Glenn Hui and Hanan AlDarmaki Statistics Centre - Abu Dhabi UNECE CES Work Session on Statistical Data.
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Eurostat Repeated surveys. Presented by Eva Elvers Statistics Sweden.
Rudi Seljak, Metka Zaletel Statistical Office of the Republic of Slovenia TAX DATA AS A MEANS FOR THE ESSENTIAL REDUCTION OF THE SHORT-TERM SURVEYS RESPONSE.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
Copyright 2010, The World Bank Group. All Rights Reserved. Managing processes Core business of the NSO Part 2 Strengthening Statistics Produced in Collaboration.
Automatic Editing with Hard and Soft Edits – Some First Experiences Sander Scholtus Sevinç Göksen (Statistics Netherlands)
New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, May 2005, Ottawa.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
Chapter Fourteen Data Preparation 14-1 Copyright © 2010 Pearson Education, Inc.
Sander Scholtus and Leon Willenborg Editing and Imputation in the Memobust Handbook.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
1 Dealing with Item Non-response in a Catering Survey Pauli Ollila Statistics Finland Kaija Saarni Finnish Game and Fisheries Research Institute Asmo Honkanen.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Oslo, 24–26 September 2012 Work Session on Statistical Data Editing APPLICATION OF THE DEVELOPED SAS MACRO FOR EDITING AND IMPUTATION AT.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
DATA PREPARATION: PROCESSING & MANAGEMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.
PROCESSING, ANALYSIS & INTERPRETATION OF DATA
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
1 G Lect 13W Imputation (data augmentation) of missing data Multiple imputation Examples G Multiple Regression Week 13 (Wednesday)
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Outlining a Process Model for Editing With Quality Indicators Pauli Ollila (part 1) Outi Ahti-Miettinen (part 2) Statistics Finland.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
R. Ty Jones Director of Institutional Research Columbia Basin College PNAIRP Annual Conference Portland, Oregon November 7, 2012 R. Ty Jones Director of.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Lecture 1 Introduction to econometrics
Ljubljana, 11 Mai 2011UNECE Work session on SDE Topic (vii) New and emerging methods 1 Topic (vii): New and emerging methods Discussion Discussants: Rudi.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Small area estimation combining information from several sources Jae-Kwang Kim, Iowa State University Seo-Young Kim, Statistical Research Institute July.
How to deal with quality aspects in estimating national results Annalisa Pallotti Short Term Expert Asa 3st Joint Workshop on Pesticides Indicators Valletta.
Chapter Fourteen Data Preparation 14-1 Copyright © 2010 Pearson Education, Inc.
Methods for Data-Integration
Theme (v): Managing change
Improvements in editing methods and processes for use of Value Added Tax data in UK National Accounts Martina Portanti and Robert Breton Office for National.
Theme (i): New and emerging methods
UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing April 2017 The Hague,
Introduction to Survey Data Analysis
Estimation methods for the integration of administrative sources
Estimation methods for the integration of administrative sources
The European Statistical Training Programme (ESTP)
Non response and missing data in longitudinal surveys
Jeroen Pannekoek, Sander Scholtus and Mark van der Loo
Automatic Editing with Soft Edits
Chapter 13: Item nonresponse
Jeroen Pannekoek, Mark van der Loo and Bart van den Broek
Presentation transcript:

Eurostat Statistical Data Editing and Imputation

Presented by Sander Scholtus Statistics Netherlands

Introduction Data arrive at a statistical institute... IDsize class number of employees turnover (x €1000) labour costs (x €1000) other costs (x €1000) total costs (x €1000) 0001large21349,827030, large364, medium421,462511, medium296, , small4875,00098,000547,000645, small81, small

Introduction Data arrive at a statistical institute... –…containing errors and implausible values –…containing missing values To produce statistical output of sufficient quality, these data problems have to be treated –Statistical data editing deals with errors –Imputation deals with missing values

Statistical data editing Overview –Goals –Edit rules –Different editing methods and how to combine them –Modules in the handbook

Statistical data editing – goals Traditional goal of editing: –Detect and correct all errors in the collected data Problems: –Very labour-intensive –Very time-consuming –Highly inefficient: measurement error is not the only source of error in statistical output

Statistical data editing – goals Modern goals of editing: 1.To identify possible sources of errors so that the statistical process may be improved in the future. 2.To provide information about the quality of the data collected and published. 3.To detect and correct influential errors in the collected data. 4.If necessary, to provide complete and consistent micro-data. sources: Granquist (1997), EDIMBUS (2007)

Statistical data editing – edit rules

Examples of edit rules: –Turnover ≥ 0 (non-negativity edit, hard) –Profit = Turnover – Total costs (balance edit, hard) –IF (Size class = “Small”) THEN (0 ≤ Number of employees < 10) (conditional edit, soft) –IF (Economic activity = “Construction”) THEN (a < Turnover / Number of employees < b) (ratio edit, soft)

Statistical data editing – methods deductive editing selective editing not selected selected manual editing automatic editing macro- editing statistical microdata raw microdata

Statistical data editing – methods Deductive editing –Directed at systematic errors –Deterministic detection and amendment  if-then rules  algorithms –Examples:  unit of measurement errors (e.g. “4,000,000” instead of “4,000”)  sign errors (e.g. “–10” instead of “10”)  simple typing errors (e.g. “192” instead of “129”)  subject-matter specific errors

Statistical data editing – methods deductive editing selective editing not selected selected manual editing automatic editing macro- editing statistical microdata raw microdata

Statistical data editing – methods

deductive editing selective editing not selected selected manual editing automatic editing macro- editing statistical microdata raw microdata

Statistical data editing – methods Manual editing –Requires:  Human editors (subject-matter specialists)  Dedicated software (interactive editing)  Edit rules (hard and soft)  Editing instructions –Re-contacts with businesses are sometimes used –Important as a source for improvements in future rounds of a repeated survey

Statistical data editing – methods deductive editing selective editing not selected selected manual editing automatic editing macro- editing statistical microdata raw microdata

Statistical data editing – methods Automatic editing –Obtain consistent micro-data for non-influential records –Paradigm of Fellegi and Holt (1976): Data should be made consistent with the edit rules by changing the fewest possible (weighted) number of items.  Leads to error localisation as a mathematical optimisation problem  Imputation of new values as a separate step –Requires:  (Hard) edit rules  Dedicated software (e.g.: Banff by Statistics Canada; SLICE by Statistics Netherlands; R package editrules )

Statistical data editing – methods deductive editing selective editing not selected selected manual editing automatic editing macro- editing statistical microdata raw microdata

Statistical data editing – methods Macro-editing –Also known as output editing –Same purpose as selective editing –Uses data from all available records at once –Aggregate method:  Compute high-level aggregates  Check their plausibility  Drill down to suspicious lower-level aggregates  Eventually: Drill down to suspicious individual records  Feedback to manual editing –Graphical aids (scatter plots, etc.) to find outliers

Statistical data editing – modules Modules in the handbook: 1.Main theme module 2.Deductive editing 3.Selective editing 4.Automatic editing 5.Manual editing 6.Macro-editing 7.Editing administrative data 8.Editing for longitudinal data

Imputation Overview –Missing data –Imputation methods –Special topics –Modules in the handbook

Imputation – missing data Missing data may occur because of –Logical reasons  A particular question does not apply to a particular unit –Unit non-response  No data observed at all for a particular unit –Item non-response  Unit is not able to answer a particular question  Unit is not willing to answer a particular question –Editing  Originally observed value discarded during automatic editing

Imputation – missing data Imputation: filling in new (estimated) values for data items that are missing Commonly used for missing data due to item non-response and editing Obtain a completed micro-data file prior to estimation –Simplifies the estimation step –Prevents inconsistencies in the output

Imputation – methods Deductive imputation Model-based imputation Donor imputation Assumption: All observed values are correct –Imputation applied after error localisation

Imputation – methods Deductive imputation –Derive (rather than estimate) missing values from observed values based on  logical relations (edit rules)  substantive imputation rules –Can be very useful as a first imputation step IDturnover (sales) turnover (services) turnover (other) turnover (total) IDturnover (sales) turnover (services) turnover (other) turnover (total) IDturnover (sales) turnover (services) turnover (other) turnover (total)

Imputation – methods Model-based imputation –Imputations based on a predictive model –Model fitted on the observed data, then used to impute the missing data

Imputation – methods

Model-based imputation –Choice of model depends on intended use of data  Estimating means and totals: mean or ratio imputation may be sufficient  General purpose micro-data: important to model relationships –Multivariate model-based imputation  Multivariate regression imputation (joint model for all variables)  Sequential regression / chained equations (separate model for each variable, conditional on the other variables)

Imputation – methods Donor imputation –Missing values imputed by ‘borrowing’ observed values from other (similar) units  Unit with observed value: donor  Unit with missing value: recipient –Hot deck: donor and recipient in the same data file

Imputation – methods Donor imputation –Special cases:  Random hot deck imputation Donor selected at random (within classes) Use auxiliary variables to define imputation classes  Nearest-neighbour imputation Donor selected with minimal distance to recipient Use auxiliary variables to define distance  Predictive mean matching Special case of nearest-neighbour imputation Distance based on predicted values from a regression model

Imputation – special topics Choice of method/model/auxiliary variables –General problem in multivariate analysis –Auxiliary variables should explain  the target variable(s)  the missing data mechanism –Compare model fit among item respondents  Can be misleading (“imputation bias”) –Simulation experiments with historical data

Imputation – special topics Imputation for longitudinal data –Repeated cross-sectional surveys –Panel studies Special imputation methods for longitudinal data –Last observation carried forward –Interpolation –Extrapolation –Little and Su method

Imputation – special topics Imputations are estimates –Imputed values should be flagged Variance estimation with imputed data –Variance likely to be underestimated when…  …imputations are treated as observed variables  …model predictions are imputed without a disturbance term  …single imputation is used –Alternative approach: Multiple imputation  Not often used in official statistics (yet)

Imputation – special topics Imputed values may be invalid/inconsistent –Examples:  Turnover = –100 (invalid)  Labour costs = 0, Number of employees = 15 (inconsistent) –Need not be a problem for estimating aggregates –Can be a problem if micro-data are distributed further Imputation under edit constraints –One-step method: constrained imputation model –Two-step method: imputation followed by data reconciliation

Imputation – modules Modules in the handbook: 1.Main theme module 2.Deductive imputation 3.Model-based imputation 4.Donor imputation 5.Imputation for longitudinal data 6.Little and Su method 7.Imputation under edit constraints

Thank you for your attention!

References EDIMBUS (2007), Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys. Fellegi, I.P. and D. Holt (1976), A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association 71, pp. 17–35. Granquist, L. (1997), The New View on Editing. International Statistical Review 65, pp. 381–387.