Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.

Slides:



Advertisements
Similar presentations
Introduction to Hypothesis Testing
Advertisements

Katherine Jenny Thompson
Editing and Imputing VAT Data for the Purpose of Producing Mixed- Source Turnover Estimates Hannah Finselbach and Daniel Lewis Office for National Statistics,
Cox Model With Intermitten and Error-Prone Covariate Observation Yury Gubman PhD thesis in Statistics Supervisors: Prof. David Zucker, Prof. Orly Manor.
Chapter 8 Estimation: Additional Topics
How to deal with missing data: INTRODUCTION
The Calibration Process
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
Chapter 14 Inferential Data Analysis
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Combining administrative and survey data: potential benefits and impact on editing and imputation for a structural business survey UNECE Work Session on.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
earthobs.nr.no Land cover classification of cloud- and snow-contaminated multi-temporal high-resolution satellite images Arnt-Børre Salberg and.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Q2010, Helsinki Development and implementation of quality and performance indicators for frame creation and imputation Kornélia Mag László Kajdi Q2010,
Integrating administrative and survey data in the new Italian system for SBS: quality issues O. Luzi, F. Oropallo, A. Puggioni, M. Di Zio, R. Sanzo Nurnberg,
A generic tool to assess impact of changing edit rules in a business survey – SNOWDON-X Pedro Luis do Nascimento Silva Robert Bucknall Ping Zong Alaa Al-Hamad.
THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, A.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
for statistics based on multiple sources
Resampling techniques
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Question paper 1997.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Outlining a Process Model for Editing With Quality Indicators Pauli Ollila (part 1) Outi Ahti-Miettinen (part 2) Statistics Finland.
The Impact of Missing Data on the Detection of Nonuniform Differential Item Functioning W. Holmes Finch.
Topic (i): Selective editing / macro editing Discussants Orietta Luzi - Italian National Statistical Institute Rudi Seljak - Statistical Office of Slovenia.
Missing Values Raymond Kim Pink Preechavanichwong Andrew Wendel October 27, 2015.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.
1 Chapter 8 Interval Estimation. 2 Chapter Outline  Population Mean: Known  Population Mean: Unknown  Population Proportion.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Univariate Gaussian Case (Cont.)
1 Probability and Statistics Confidence Intervals.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Numeracy & Quantitative Methods: Level 7 – Advanced Quantitative Analysis.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Chapter 9 Introduction to the t Statistic
An R package for selective editing based on a latent class model
Multiple Imputation.
CONCEPTS OF HYPOTHESIS TESTING
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Survey phases, survey errors and quality control system
Bootstrap - Example Suppose we have an estimator of a parameter and we want to express its accuracy by its standard error but its sampling distribution.
Survey phases, survey errors and quality control system
Ch13 Empirical Methods.
The European Statistical Training Programme (ESTP)
LECTURE 09: BAYESIAN LEARNING
Statistical Power.
Chapter 13: Item nonresponse
UNECE Work Session on Statistical Data Editing, Paris, 2014
Presentation transcript:

Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute UN/ECE Work Session on Statistical Data Editing Ottawa, May 2005

Outline Introduction The simulation approach Perfomance indicators An example: the Istat software ESSE

Quality of E&I = Accuracy  accuracy at micro level Capability of editing of correctly identifying errors / the capability of imputation of correctly recovering true data  accuracy at macro level Capability of editing/imputation of preserving the data distributions and target estimates  true  The quality of E&I in terms of accuracy can be measured only when it is possible to compare the edited and imputed data with the corresponding true ones

Why evaluating the quality of E&I  Analysis of the performance of an editing/imputation method  for a specific type of data/error  under different data/error scenarios  Improve the performance of an editing/imputation method for a specific type of data/error  Choose among alternative editing/imputation methods for a specific type of data/error

“E&I represent additional sources of non sampling errors in the statistical production process” The evaluation framework True values Observed (corrupted) values Localized errors Final values ? ? ? ? ? ? ? Error/missing mechanisms Editing model Imputation model (Super-population/ Finite populatoin)

 The evaluation of the quality of editing and/or imputation has to be performed taking into account the other mechanisms involved in the statistical production process  This correspond to measuring the effects on data induced by the editing and/or the imputation mechanisms conditionally to the other mechanisms influencing the survey results Evaluating the quality of E&I

The simulation approach Artificial generation of some of the key elements of the evaluation framework based on predefined mechanisms/models  Controlled experiments  data distributions and data relations  error and missing data mechanisms  error and missing data incidence  Variability due to each stochastic mechanism (repeated simulations)  Low cost

The simulation approach  High modelling effort – true data – raw data

Simulation of true data Let (X 1, …, X p ) be a random variable following the probability function F(x 1, …, x p ;   F(x 1, …, x p ;  )  unknown  parametric approaches (specify a data model; estimate parameters; re-sampling techniques)  non parametric approaches (no assumptions; re-sampling techniques)

Simulation of true data Additional problems:  Modelling multivariate distributions (reproducing joint relations/dependencies between variables)  Modelling asymmetric multivariate distributions  Modelling under edit constraints

Simulation of raw data Parametric/non parametric approaches:  Generating missing data  Generating errors (deviations from true data)

Simulation of missing data  Assumptions on non response mechanisms (MCAR, MAR, NMAR)  Assumptions on the incidence of non response (non response rates)  In multivariate contexts, modelling patterns of non response  Assumptions on multivariate non response mechanisms (e.g. independence)  Assumptions on rates of non response patterns

Simulation of errors  Assumptions on error mechanism (EAR, ECAR, ENAR)  Assumptions on the incidence of errors (error rates)  Assumptions on the intensity of errors (error magnitude; intermittent nature of errors)  In a multivariate context, modelling error patterns:  Assumptions on multivariate error mechanisms (e.g. independence)  Assumptions on rates of error patterns  Overlapping mechanisms (e.g. stochastic+ systematic)  Simulation of errors under constraints

How to measure: evaluation indicators under the simulation approach  Evaluation objectives  Accuracy at micro level  Accuracy w.r.t. distributions and target estimates  Indicators  Level (micro/macro; local/global)  Identification  Priority

An Istat tool for evaluating E&I under the simulation approach ESSE (Editing Systems Standard Evaluation) system (SAS language + SAS/AF environment)  Module for raw data simulation  Module for evaluation

Module for raw data simulation  Approach: non parametric  Missing data mechanisms: MCAR, MAR and independent non responses  Error mechanisms: Completely At Random (ECAR) and independent errors (e.g. Misplacement errors, Interchange of values, Interchange errors, Loss or addition of zeroes,….)

Module for evaluation Assumptions  Editing is a classification procedure that assigns each raw value into one of two states: -(1) acceptable -(2) not acceptable  Imputation affects only values previously classified by the editing process as unacceptable.  Imputation is successful if the new assigned value is equal to the original one

Module for evaluation  Evaluation objective: assessing the accuracy of E&I at micro level (capability to detect as many errors as possible; capability to to restore the true values)  Evaluation approach: single application of E&I (no variability)  Evaluation level: micro level  Indicators: local indicators (hit rates) based on the number of detected, undetected, introduced and corrected errors

Future work at ISTAT  Identify standard measures to assess the accuracy of E&I at macro level  Simulating multivariate patterns of errors/missing values (dependent errors/non response)  Evaluating the impact of E&I on variability at micro/macro level