Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007.

Slides:



Advertisements
Similar presentations
By: Saad Rais, Statistics Canada Zdenek Patak, Statistics Canada
Advertisements

Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
Data Imputation United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
Module B-4: Processing ICT survey data TRAINING COURSE ON THE PRODUCTION OF STATISTICS ON THE INFORMATION ECONOMY Module B-4 Processing ICT Survey data.
CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Editing and Imputing VAT Data for the Purpose of Producing Mixed- Source Turnover Estimates Hannah Finselbach and Daniel Lewis Office for National Statistics,
SDC for continuous variables under edit restrictions Natalie Shlomo & Ton de Waal UN/ECE Work Session on Statistical Data Editing, Bonn, September 2006.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Deliverable 2.8: Outliers Gary Brown Office for National Statistics UK.
1 Editing Administrative Data and Combined Data Sources Introduction.
The Comparison of the Software Cost Estimating Methods
BA 555 Practical Business Analysis
11-1 Copyright  2006 McGraw-Hill Australia Pty Ltd Revised PPTs t/a Auditing and Assurance Services in Australia 3e by Grant Gay and Roger Simnett Slides.
1 Methods for detecting errors in VAT Turnover data Phil Lewis Processing, Editing and Imputation branch Business Statistics Methods-Survey Methodology.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
AUDIT PROCEDURES. Commonly used Audit Procedures Analytical Procedures Analytical Procedures Basic Audit Approaches - Basic Audit Approaches - System.
Data Editing United Nations Statistics Division (UNSD) 16 March 2011 Santiago, Chile.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Eurostat Statistical Data Editing and Imputation.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Rudi Seljak, Metka Zaletel Statistical Office of the Republic of Slovenia TAX DATA AS A MEANS FOR THE ESSENTIAL REDUCTION OF THE SHORT-TERM SURVEYS RESPONSE.
Chapter 9 Audit Sampling: An Application to Substantive Tests of Account Balances This presentation focuses (like my course) on MUS. It omits the effect.
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
A generic tool to assess impact of changing edit rules in a business survey – SNOWDON-X Pedro Luis do Nascimento Silva Robert Bucknall Ping Zong Alaa Al-Hamad.
S14: Analytical Review and Audit Approaches. Session Objectives To define analytical review To define analytical review To explain commonly used analytical.
New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, May 2005, Ottawa.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
Time series Decomposition Farideh Dehkordi-Vakil.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
CBS-SSB STATISTICS NETHERLANDS – STATISTICS NORWAY Work Session on Statistical Data Editing Oslo, Norway, September 2012 Jeroen Pannekoek and Li-Chun.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Relative Values. Statistical Terms n Mean:  the average of the data  sensitive to outlying data n Median:  the middle of the data  not sensitive to.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Workshop on Price Index Compilation Issues February 23-27, 2015 Data Collection Issues Gefinor Rotana Hotel, Beirut, Lebanon.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Robust Estimators.
Copyright 2010, The World Bank Group. All Rights Reserved. Testing and Documentation Part II.
Sampling Design and Analysis MTH 494 Lecture-22 Ossam Chohan Assistant Professor CIIT Abbottabad.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Analytical Review and Audit Approaches
Tutorial I: Missing Value Analysis
1 NONLINEAR REGRESSION Suppose you believe that a variable Y depends on a variable X according to the relationship shown and you wish to obtain estimates.
Linear Regression Linear Regression. Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Purpose Understand Linear Regression. Use R functions.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Study of Editing and Imputation Practices at Statistics Finland Janika Konnu and Pauli Ollila Statistics Finland Q2010: Editing session Wednesday 5 th.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
Research Methodology Lecture No :32 (Revision Chapters 8,9,10,11,SPSS)
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
FDI - Imputation. Overview Introduction Overview of Imputation Methods Overview of Outliering methods Overview of Estimation methods Aggregation Disclosure.
How to deal with quality aspects in estimating national results Annalisa Pallotti Short Term Expert Asa 3st Joint Workshop on Pesticides Indicators Valletta.
(8) Potential required for planning with management Top-Down Estimating Method: Top-down estimating method is also called Macro Model. Using it, estimation.
Computer aided teaching of statistics: advantages and disadvantages
Theme (i): New and emerging methods
Modeling approaches for the allocation of costs
UNITED NATIONS ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing April 2017 The Hague,
Introduction to Instrumentation Engineering
The European Statistical Training Programme (ESTP)
Jeroen Pannekoek, Sander Scholtus and Mark van der Loo
Data processing German foreign trade statistics
Sampling and estimation
Automatic Editing with Soft Edits
Chapter 13: Item nonresponse
A handbook on validation methodology. Metrics.
Presentation transcript:

Integrated Data Editing and Imputation Ton de Waal Department of Methodology Voorburg Statistics Netherlands ICES III conference, Montréal June 19, 2007

What is statistical data editing and imputation? Observed data generally contain errors and missing values Statistical Data Editing (SDE): process of checking observed data, and, when necessary, correcting them Imputation: process of estimating missing data and filling these values in into data set

What is integrated SDE and imputation? Integration of error localization and imputation Integration of several edit and imputation techniques to optimize edit and imputation process Integration of statistical data editing into rest of statistical process

What is integrated SDE and imputation? Integration of error localization and imputation Integration of several edit and imputation techniques to optimize edit and imputation process Integration of statistical data editing into rest of statistical process

SDE and the survey process We will focus on identifying and correcting errors Other goals of SDE are identify error sources in order to provide feedback on entire survey process provide information about the quality of incoming and outgoing data Role of SDE is slowly shifting towards these goals feedback on other survey phases can be used to improve those phases and reduce amount of errors arising in these phases

Edits Edit rules, or edits for short, often used to determine whether record is consistent or not Inconsistent records are considered to contain errors Consistent records that are also not suspicious otherwise, e.g. are not outlying with respect to the bulk of the data, are considered error-free Example of edits (T turnover, P profit, and C costs): T = P + C (balance edit) T 0

SDE and imputation Three related problems: Error localization: determine which values are erroneous Correction: correct missing and erroneous data in best possible way Consistency: adjust values such that all edits become satisfied Correction often done by means of imputation

SDE and imputation Three related problems: Error localization: determine which values are erroneous Imputation: impute missing and erroneous data in best possible way Consistency: adjust imputed values such that all edits become satisfied

SDE and imputation Three related problems: Error localization: determine which values are erroneous Imputation: impute missing and erroneous data in best possible way Consistency: adjust imputed values such that all edits become satisfied Most SDE techniques focus on error localization

SDE in the old days Use of computers in SDE started many years ago In early years role of computers restricted to checking which edits were violated Subject-matter specialists retrieved paper questionnaires that did not pass all edits and corrected them After correction, data were again entered into computer, and again checked whether all edits were satisfied Major problem: during manual correction process records were not checked for consistency

Modern SDE techniques Interactive editing Selective editing Automatic editing Macro-editing

Interactive editing During interactive editing a modern survey processing system (e.g. BLAISE) is used Such a system allows one to check and – if necessary – correct in a single step Advantages: number of variables, edits and records may be high quality of interactively edited data is generally high Disadvantage: all records have to be edited: costly in terms of budget and time not transparent

Selective editing Umbrella term for several methods to identify the influential errors Aim is to split data into two streams: critical stream: records that are the most likely ones to contain influential errors non-critical stream: records that are unlikely to contain influential errors Records in critical stream are edited interactively Records in non-critical stream are either not edited or are edited automatically

Selective editing Many selective editing methods are based on common sense Most often applied basic idea is to use a score function Two important components influence: measures relative influence of record on publication figure risk: measures deviation of observed values from anticipated values (e.g. medians or values from previous years)

Selective editing Local score for single variable within record usually defined as distance between observed and anticipated values, taking influence of record into account Example: W x |Y – Y*| W raising weight, Y observed value, Y* anticipated value influence component:W x Y* risk component: |Y – Y*| / Y* Local scores combined into global score for entire record by sum of local scores maximum of local scores Records with global score above certain cut-off value edited interactively

Selective editing: (dis)advantages Advantage: selective editing improves efficiency in terms of budget and time Disadvantage: no good techniques for combining local scores into global score are available if there are many variables Selective editing has gradually become popular method to edit business data

Automatic editing Two kinds of errors: systematic ones and random ones Systematic error: error reported consistently among (some) responding units gross values reported instead of net values values reported in units instead of requested thousands of units (so-called thousand-errors) Random error: error caused but by accident observed value where respondent by mistake typed in a digit too many

Automatic editing of systematic errors Can often be detected by comparing respondents present values with those from previous years comparing responses to questionnaire variables with values of register variables using subject-matter knowledge Once detected, systematic error is often simple to correct

Automatic editing of random errors Three classes of methods: methods based on statistical models (e.g. outlier detection techniques and neural networks) methods based on deterministic checking rules methods based on solving a mathematical optimization problem

Deterministic checking rules State which values are considered erroneous when record violates edits Example: if component variables do not sum up to total, total variable is considered to be erroneous Advantages: drastically improves efficiency in terms of budget and time transparency and simplicity Disadvantages: many rules have to be specified, maintained and checked for validity bias may be introduced as one aims to detect random errors in a systematic manner

Error localization as mathematical optimization problem Guiding principle is needed Freund and Hartley (1967): minimize sum of the distance between observed and corrected data and a measure for violation of edits Casado Valera et al. (90s): minimize quadratic function measuring distance between observed and corrected data such that corrected data satisfy all edits Bankier (90s): impute missing data and potentially erroneous values by means of donor imputation, and select imputed record that satisfies all edits and that is closest to original record

Fellegi-Holt paradigm (1976) Data should be made to satisfy all edits by changing values of fewest possible number of variables Generalization: data should be made to satisfy all edits by changing values of variables with smallest possible sum of reliability weights reliability weight expresses how reliable one considers values of this variable to be high reliability weight corresponds to variable of which values are considered trustworthy

Fellegi-Holt paradigm: (dis)advantages Advantages: drastically improves efficiency in terms of budget and time in comparison to deterministic checking rules less, and less detailed, rules have to be specified Disadvantages: class of errors that can safely be treated is limited to random errors class of edits that can be handled is restricted to so-called hard (or logical) edits which hold true for all correctly observed records risky to treat influential errors by means of automatic editing

Macro-editing Macro-editing techniques often examine potential impact on survey estimates to identify suspicious data in individual records Two forms of macro-editing aggregation method distribution method

Macro-editing: aggregation method Verification whether figures to be published seem plausible Compare quantities in publication tables with same quantities in previous publications quantities based on register data related quantities from other sources

Macro-editing: distribution method Available data used to characterize distribution of variables Individual values compared with this distribution Records containing values that are considered uncommon given the distribution are candidates for further inspection and possibly for editing

Macro-editing: graphical techniques Exploratory Data Analysis techniques can be applied box plots scatter plots (outlier robust) fitting Other often used techniques in software applications anomaly plots: graphical overviews of important estimates, where unusual estimates are highlighted time series analysis outlier detection methods Once suspicious data have been detected on a macro- level one can drill-down to sub-populations and individual units

Macro-editing: (dis)advantages Advantages: directly related to publication figures or distribution efficient in term of budget and time Disadvantages: records that are considered non-suspicious may still contain influential errors publication of unexpected (but true) changes in trend may be prevented for data sets with many important variables graphical macro-editing is not the most suitable SDE method most persons cannot interpret 10 scatter plots at the same time

Integrating SDE techniques We advocate an SDE approach that consists of the following phases: correction of evident systematic errors application of selective editing to split records in critical stream and non-critical stream editing of data: records in critical stream edited interactively records in non-critical stream edited automatically validation of the publication figures by means of (graphical) macro-editing

Imputation Expert guess Deductive imputation Multivariate regression imputation Nearest neighbor hot-deck imputation Ratio hot-deck imputation

Deductive imputation Sometimes missing values can be determined unambiguously from edits Examples: single missing value involved in balance edit for non-negative variables: if a total variable has zero value all missing subtotal (component) variables are zero

Regression imputation Regression model per variable to be imputed Y = A + B X + e Imputations for missing data can be obtained from Y = A est + B est X or from Y = A est + B est X + e* where e* is drawn from appropriate distribution

Regression imputation Imputation can also be based on multivariate regression model that relates each missing value to all observed values Y mis = Mean mis + B(Y obs – Mean obs ) + e Estimates of model parameters can be obtained by using EM-algorithm Imputations for missing data can be obtained from Y mis = Mean est,mis + B est (Y obs – Mean est,obs ) or from Y mis = Mean est,mis + B est (Y obs – Mean est,obs ) + e* where e* is drawn from appropriate distribution

Nearest neighbor hot deck imputation For each receptor record with missing values on some (target) variables a donor record is selected that has no missing values on auxiliary and target variables smallest distance to receptor Replace missing values by values from donor Often used distance measure is minimax distance Z si : value of scaled auxiliary variable i in record s distance between records s and t: D(s,t) = max_i |Z si – Z ti |

Ratio hot deck imputation Modified version of nearest neighbor hot-deck for variables that are part of balance edit Calculate difference between total variable and sum of observed components this difference equals the sum of the missing components Sum of missing components are distributed over missing components using ratios (of missing components to sum of missing components) from donor record level of imputed components is determined by total variable but their ratios are determined by donor imputed and observed components add up to total

Example of ratio hot deck P + C = T Record to be imputed given by T = 400, P = ?, C = ? Donor record T = 100, P = 25, C = 75 Imputed record T = 400, P = 100, C = 300

Consistency If imputed values violate edits, adjust them slightly Observed values not adjusted Minimize Σ i w i |Y i,final – Y i,imp | subject to restriction that Y i,final in combination with observed values satisfy all edits Y i,imp : imputed values (possibly failing edits) Y i,final : final values w i : user-specified weights As numerical edits are generally linear (in)equalities, resulting problem is a linear programming problem

Consistency Prerequisite: it should be possible to find values Y i,final such that all edits become satisfied this is the case if Fellegi-Holt paradigm has been applied to identify errors Instead of first imputing and then adjusting values, better (but more complicated) approach is to impute under restriction that edits become satisfy see doctorate thesis by Caren Tempelman (Statistics Netherlands,

Conclusion All editing and imputation methods have their own (dis)advantages Integrated use of editing techniques (selective editing, interactive editing, automatic editing, and macro-editing) as well as various imputation techniques can improve efficiency of SDE and imputation process while at same time maintaining or even enhancing statistical quality of produced data