An R package for selective editing based on a latent class model

Slides:



Advertisements
Similar presentations
Katherine Jenny Thompson
Advertisements

Structural Equation Modeling Using Mplus Chongming Yang Research Support Center FHSS College.
Evaluating Diagnostic Accuracy of Prostate Cancer Using Bayesian Analysis Part of an Undergraduate Research course Chantal D. Larose.
Quality Guidelines for statistical processes using administrative data European Conference on Quality in Official Statistics Q2014 Giovanna Brancato, Francesco.
Chapter 4: Linear Models for Classification
Confidence Measures for Speech Recognition Reza Sadraei.
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Vienna, 23 April 2008 UNECE Work Session on SDE Topic (v) Editing on results (post-editing) 1 Topic (v): Editing based on results Discussants: Maria M.
Work Package 5: Integrating data from different sources in the production of business statistics Daniel Lewis Office for National Statistics (UK)
The Edit Anders Norberg, Statistics Sweden (SCB) Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
UNITED NATIONS - ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS Work Session on Statistical Data Editing Use of administrative data.
THE MAIN INNOVATIONS OF DATA EDITING AND IMPUTATION FOR THE 2010 ITALIAN AGRICULTURAL CENSUS G. Bianchi, R. M. Lipsi, P. Francescangeli, G. Ruocco, A.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy.
The application of selective editing to the ONS Monthly Business Survey Emma Hooper Office for National Statistics
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
Jeroen Pannekoek - Statistics Netherlands Work Session on Statistical Data Editing Oslo, Norway, 24 September 2012 Topic (I) Selective and macro editing.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Generalised method of moments approach to testing the CAPM Nimesh Mistry Filipp Levin.
HMM - Part 2 The EM algorithm Continuous density HMM.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching using a Factor Model, an application.
Topic (i): Selective editing / macro editing Discussants Orietta Luzi - Italian National Statistical Institute Rudi Seljak - Statistical Office of Slovenia.
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
New and Emerging Methods UN/ECE Work Session on Statistical Data Editing Vienna April 21-23, 2008.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Ljubljana, 11 Mai 2011UNECE Work session on SDE Topic (vii) New and emerging methods 1 Topic (vii): New and emerging methods Discussion Discussants: Rudi.
Ljubljana, 11 Mai 2011UNECE Work session on SDE Topic (vii) New and emerging methods 1 Topic (vii): New and emerging methods Introduction Session organizers:
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Estimating standard error using bootstrap
Data Modeling Patrice Koehl Department of Biological Sciences
Theme (v): Managing change
Theme (i): New and emerging methods
Multiple Imputation using SOLAS for Missing Data Analysis
Modeling approaches for the allocation of costs
Chapter 11: Simple Linear Regression
Statistics in MSmcDESPOT
PSG College of Technology
Week 10 Chapter 16. Confidence Intervals for Proportions
An Active Collection using Intermediate Estimates to Manage Follow-Up of Non-Response and Measurement Errors Jeannine Claveau, Serge Godbout and Claude.
Survey phases, survey errors and quality control system
Cost Estimation Chapter 5
Improving the efficiency of editing in ONS business surveys
Survey phases, survey errors and quality control system
CHAPTER 3 Describing Relationships
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
A handbook on validation methodology Marco Di Zio Istat
OVERVIEW OF LINEAR MODELS
Using statistics to evaluate your test Gerard Seinhorst
LIMITED DEPENDENT VARIABLE REGRESSION MODELS
Structural Equation Modeling
OVERVIEW OF LINEAR MODELS
Three Measures of Influence
Jeroen Pannekoek, Sander Scholtus and Mark van der Loo
Parametric Methods Berlin Chen, 2005 References:
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Evaluating Hypothesis
Problems of Tutorial 9 (Problem 4.12, Page 120) Download the “Data for Exercise ” from the class website. The data consist of 1 response variable.
Sampling and estimation
Jia-Bin Huang Virginia Tech
A handbook on validation methodology. Metrics.
New and Emerging Methods
Machine Learning: Lecture 5
Presentation transcript:

An R package for selective editing based on a latent class model M.T. Buglielli, M. Di Zio, U. Guarnera and F. R. Pogelli Istat – Italy UNECE Conference Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011

Introduction UNECE Work Session on Statistical Data Editing Selective editing looks for units affected by important errors in order to limit accurate reviewing. Error quantification - Observations are prioritised according to the values of a score function that expresses the impact of their potential error on the estimates of interest. Accuracy level - Units above a given threshold are selected since they potentially represent the observations affected by important errors. Ljubljana, Slovenia, 9-11 May 2011

Problems UNECE Work Session on Statistical Data Editing The score function is generally based on the difference between observed and “anticipated” values. The problem is that differences are due to both errors and to the natural variability of the phenomenon. Score values cannot be interpreted as a direct evaluation of the accuracy of estimates. Without historical (true and contaminated) information it is not possible to select the most influential units such that a prefixed level of accuracy for the target estimates is attained. Ljubljana, Slovenia, 9-11 May 2011

A latent model approach – The contamination model UNECE Work Session on Statistical Data Editing The use of a latent model for true data and errors, allows to distinguish the error and the variability component of the residuals the score value of an observation is directly interpreted as the expected error of the units. The method estimates the probability of being in error and the error impact, that suitably combined determine the conditional expected error In this framework we can select units by estimating the expected error left in data once they are restored (also without hist info) Ljubljana, Slovenia, 9-11 May 2011

SeleMix a software for selective editing UNECE Work Session on Statistical Data Editing SeleMix is a package in R for the selection of influential errors according to the contamination model. Implements the ECM-algorithm developed to estimate model parameters computes local and global scores returns the set of observations affected by influential errors with respect to a certain prefixed level of accuracy of the target estimates. Moreover, it provides anticipated values (predictions) for each unit for both observed and non observed variables. The imputation can be considered “robust” in that the model used to compute the “anticipated” values takes into account the presence of errors in data. Ljubljana, Slovenia, 9-11 May 2011

SeleMix functions UNECE Work Session on Statistical Data Editing The package is composed of three functions ml.est, pred.y, sel.edit. ml.est - This function estimates the parameters of the model by using an ECM-algorithm suitably developed. The output is a list of: model parameters, anticipated values, BIC and AIC scores, outlier flags, and posterior probabilities pred.y - makes a prediction of the true values for the variables Y through their expected value conditional on all the available information. It returns, for each unit, a "prediction" for both observed and missing items of each Y variable, the outlier flag and the posterior probability. Ljubljana, Slovenia, 9-11 May 2011

SeleMix UNECE Work Session on Statistical Data Editing sel.edit - This function prioritises observations according to the score function values and flags the units to be edited so that the expected residual error is below a prefixed level of accuracy. The output of sel.edit is a matrix containing the flag of influential units, the observed and anticipated values ordered by the global score, the local scores. Ljubljana, Slovenia, 9-11 May 2011

Warning 1) Model assumptions True data are log-normal/normal UNECE Work Session on Statistical Data Editing 1) Model assumptions True data are log-normal/normal Error is Gaussian and it inflates the covariance matrix However: The Gaussian or log-normal assumption is frequently adopted Some experiments show that it can be usefully applied to cases when data depart form the assumptions 2) The accuracy level is for estimates of totals (means). Ljubljana, Slovenia, 9-11 May 2011

Warning - edits UNECE Work Session on Statistical Data Editing 1) it is generally difficult to incorporate fatal edits in the model 2) On the other hand, soft edits (when the values are anomalous but plausible) are implicitly considered since the units are classified as erroneous with a certain probability, and this probability is explicitly considered in the computation of the score. Ljubljana, Slovenia, 9-11 May 2011

If you are interested… UNECE Work Session on Statistical Data Editing The software can be freely downloaded from www.osor.eu the Open Source Observatory and Repository for European public administrations (OSOR). In future it will be made available on the Cran library (R website) Ljubljana, Slovenia, 9-11 May 2011