Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical.

Slides:



Advertisements
Similar presentations
Variance Estimation When Donor Imputation is Used to Fill in Missing Values Jean-François Beaumont and Cynthia Bocci Statistics Canada Third International.
Advertisements

Non response and missing data in longitudinal surveys.
Unido.org/statistics International workshop on industrial statistics 8 – 10 July, Beijing Non response in industrial surveys Shyam Upadhyaya.
Unsupervised Learning
Lwando Kondlo Supervisor: Prof. Chris Koen University of the Western Cape 12/3/2008 SKA SA Postgraduate Bursary Conference Estimation of the parameters.
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
Chapter 5 Operating Systems. 5 The Operating System When working with multimedia, the operating system is perhaps the most important, the most complex,
Exploratory Analysis of Survey Data Lisa Cannon Luke Peterson.
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Goodness of Fit of a Joint Model for Event Time and Nonignorable Missing Longitudinal Quality of Life Data – A Study by Sneh Gulati* *with Jean-Francois.
The Empirical FT continued. What is the large sample distribution of the EFT?
Eurostat Statistical Data Editing and Imputation.
Marketing Research, 2 nd Edition Alan T. Shao Copyright © 2002 by South-Western PPT-1 CHAPTER 17 BIVARIATE STATISTICS: NONPARAMETRIC TESTS.
Programming Languages
1 Using R for consumer psychological research Research Analytics | Strategy & Insight September 2014.
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (28-30 October 2009) Accuracy evaluation of Nuts level 2 hypercubes with the adoption of.
1 The system aspect of statistical quality Q2014 european conference on quality in official statistics Special session: Consistency of Concepts and Applied.
Crop area estimates with area frames in the presence of measurement errors Elisabetta Carfagna University of Bologna Department.
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Chi-Square Procedures Chi-Square Test for Goodness of Fit, Independence of Variables, and Homogeneity of Proportions.
Statistical Matching in the framework of the modernization of social statistics Aura Leulescu & Emilio Di Meglio EUROSTAT Unit F3 - Living conditions and.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Oslo, 24–26 September 2012 Work Session on Statistical Data Editing APPLICATION OF THE DEVELOPED SAS MACRO FOR EDITING AND IMPUTATION AT.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Question paper 1997.
Topic (iii): Macro Editing Methods Paula Mason and Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching using a Factor Model, an application.
Chapter Outline Goodness of Fit test Test of Independence.
Topic (i): Selective editing / macro editing Discussants Orietta Luzi - Italian National Statistical Institute Rudi Seljak - Statistical Office of Slovenia.
Learning Simio Chapter 10 Analyzing Input Data
Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Chapter 14 Chi-Square Tests.  Hypothesis testing procedures for nominal variables (whose values are categories)  Focus on the number of people in different.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Incorporating Uncertainties into Economic Forecasts: an Application to Forecasting Economic Activity in Croatia Dario Rukelj Ministry of Finance of the.
Elaborating on the Business Architecture of SN Robbert Renssen Statistics Netherlands Standard Process Steps.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Small area estimation combining information from several sources Jae-Kwang Kim, Iowa State University Seo-Young Kim, Statistical Research Institute July.
Chapter 5 Operating Systems.
Methods for Data-Integration
Session D12: Multisource statistics New sources: new modelling approaches Author: Gras Fabrice, Eurostat, unit B1, Methodology and corporate architecture.
Probabilistic Data Management
Estimation methods for the integration of administrative sources
Estimation methods for the integration of administrative sources
Statistical matching under the conditional independence assumption Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept.
The European Statistical Training Programme (ESTP)
MEASUREMENT OF THE QUALITY OF STATISTICS
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
ICW – progress report Item 4.6 of the agenda
Marco Di Zio Dept. Integration, Quality, Research and Production
The Empirical FT. What is the large sample distribution of the EFT?
Non response and missing data in longitudinal surveys
BIVARIATE ANALYSIS: Measures of Association Between Two Variables
A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.
How to Choose the Matching Variables Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic.
Creating a synthetic database for research in migration and subjective well-being Statistical Matching techniques for combining the complementary questionnaires.
Chapter 13: Item nonresponse
Introduction to Machine learning
Presentation transcript:

Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical Matching and Imputation of Survey Data with the Package “Statmatch” for the Environment

What is Statistical Matching? UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011 Statistical Matching (data fusion o synthetic matching) consists in a series of methods to integrate two or more data sources referred to the same target population. Basic SM framework: YX source A XZ source B 1.X variables are in common 2.Y and Z are NOT jointly observed 3.The chance of observing the same unit in A and B is close to zero

Objectives of Statistical Matching micro: derive a “synthetic” data-set with X, Y and Z A filled-in with Z YXZ macro: estimation of parameters: correlation coef. ( ) or frequencies UNECE Work Session on Statistical Data Editing Approaches Objectives SMParametricNonparametricMixed Macro  Micro  Ljubljana, 9-11 May 2011

“StatMatch” provides R functions to apply some Statistical Matching methods Generalization and optimization of the code provided with the monograph about SM by D’Orazio et al. (2006). The first version of StatMatch (version 0.4) released on CRAN (Comprehensive R Archive Network) in In the beginning of 2011 the version has been released; this version present a significant improvement of the functionalities of the previous version (0.8 released in 2009). Package available for: MS Windows (32 and 64 bit), Linux, Mac The package StatMatch for the R environment UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Five main groups of functions:  functions to perform nonparametric SM at micro level by means of hot deck imputation ( NND.hotdeck, RANDwNND.hotdeck, rankNND.hotdeck );  a function to perform mixed SM at macro or micro level for continuous variables ( mixed.mtc );  functions to integrate data from complex sample surveys through calibration of weights as proposed by Renssen (1998) ( harmonize.x and comb.samples );  functions to explore uncertainty on the contingency table Y x Z ( Frechet.bounds.cat and Fbwidhts.by.x );  other functions to compute distances ( gower.dist and maximum.dist ), to create the synthetic data set ( create.fused ), etc. Functions in StatMatch UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

NND.hotdeck() nearest neighbour distance hot deck: - many distance functions - imputation classes - constrained or unconstrained RANDwNND.hotdeck() random hot deck and some variants - random hot deck in classes - random hot deck in “moving” classes - it is possible to use weights rankNND.hotdeck() nearest neighbour with distance computed on the percentage points of the empirical cumulative distribution function of X SM via hot deck imputation UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

mixed.mtc() mixed SM methods for continuous variables: consist in two steps: 1) fits regression models (regression) Y vs. X and Z vs. X 2) fills A with units chosen by means of constrained distance hot deck computed on intermediate and live values of Y and Z - two methods to estimate regression parameters: (ML and Moriarity&Scheuren, 2001) - possibility of introducing auxiliary information about the correlation coef. between Y and Z Mixed SM methods UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Renssen’s (1998) approach based on a series of calibration steps of the survey weights of A and B, and if available C ( C may contain Y and Z or X, Y and Z ) harmonize.x() harmonizes the joint/marginal distribution of X variables in A and B comb.samples() estimates the contingency table Y vs. Z using available auxiliary information in C (when available): - Conditional Independence Assum. - incomplete two way stratification - synthetic two way stratification SM of data from complex sample surveys UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Frechet.bounds.cat() to derive the uncertainty bounds for frequencies in the contingency table Y vs. Z, starting from the marginal tables X vs. Y and X vs. Z Fbwidths.by.x() explores how the various possible subsets of the X variables contribute in reducing the uncertainty on the cells of Y vs. Z Exploring uncertainty due to SM basic framework UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Computational Efficiency Hot deck methods StatMatch Function # Match vars # Imp. class. Process time (secs) Notes UNconstrain ed NND NND.hotdeck dist.fun= ” Gower ” Constrained NND NND.hotdeck dist.fun= ” Gower ” constr.alg= ” rela x ” Random hot deck RANDwNND.hotdeck dist.fun= ” Gower ” cut.don="exact “ k=10 Artificial data: A contains 14,000 obs.; about 54,000 obs. in B. PC with CPU Pentium IV 3GHz, 3GB RAM, MS Windows XP Prof. (SP 3; 32bit) All the functions in StatMatch are based on R code and there are no calls to other external code (compiled C or Fortran): “Interpreted languages (Matlab, R, Python, Lisp) are fun... but slow. Compiled languages (machine code, assembly, FORTRAN, C, Java) are fast… but are work (= no fun)” Mizera (2006) UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Warning! “Although abusing R was not proved to be addictive, it should be noted that it often leads to harder stuff” Mizera (2006) Thank You for Your attention! Ljubljana, 9-11 May 2011

Some References D'Orazio, M. (2009). StatMatch: Statistical Matching. R package version D’Orazio, M., Di Zio, M., and Scanu, M. (2006) Statistical Matching: Theory and Practice. Wiley and Sons, Chichester. Mizera, I. (2006) “Graphical Exploratory Analysis Using Halfspace Depth”. Presentation at “useR!, The R User Conference 2006”, Wien, June Moriarity C., Scheuren F. (2001) “Statistical matching: a paradigm for assessing the uncertainty in the procedure”. Journal of Official Statistics, 17, 407–422. Renssen, R.H. (1998) “Use of Statistical matching techniques in calibration estimation” Survey Methodology, 24, pp UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011