Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical.

Marcello D’Orazio (madorazi@istat.it)madorazi@istat.it UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical Matching and Imputation of Survey Data with the Package “Statmatch” for the Environment

What is Statistical Matching? UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011 Statistical Matching (data fusion o synthetic matching) consists in a series of methods to integrate two or more data sources referred to the same target population. Basic SM framework: YX source A XZ source B 1.X variables are in common 2.Y and Z are NOT jointly observed 3.The chance of observing the same unit in A and B is close to zero

Objectives of Statistical Matching micro: derive a “synthetic” data-set with X, Y and Z A filled-in with Z YXZ macro: estimation of parameters: correlation coef. ( ) or frequencies UNECE Work Session on Statistical Data Editing Approaches Objectives SMParametricNonparametricMixed Macro  Micro  Ljubljana, 9-11 May 2011

“StatMatch” provides R functions to apply some Statistical Matching methods Generalization and optimization of the code provided with the monograph about SM by D’Orazio et al. (2006). The first version of StatMatch (version 0.4) released on CRAN (Comprehensive R Archive Network) in 2008. In the beginning of 2011 the version 1.0.1 has been released; this version present a significant improvement of the functionalities of the previous version (0.8 released in 2009). http://cran.at.r-project.org/web/packages/StatMatch/index.html Package available for: MS Windows (32 and 64 bit), Linux, Mac The package StatMatch for the R environment UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Five main groups of functions:  functions to perform nonparametric SM at micro level by means of hot deck imputation ( NND.hotdeck, RANDwNND.hotdeck, rankNND.hotdeck );  a function to perform mixed SM at macro or micro level for continuous variables ( mixed.mtc );  functions to integrate data from complex sample surveys through calibration of weights as proposed by Renssen (1998) ( harmonize.x and comb.samples );  functions to explore uncertainty on the contingency table Y x Z ( Frechet.bounds.cat and Fbwidhts.by.x );  other functions to compute distances ( gower.dist and maximum.dist ), to create the synthetic data set ( create.fused ), etc. Functions in StatMatch UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

NND.hotdeck() nearest neighbour distance hot deck: - many distance functions - imputation classes - constrained or unconstrained RANDwNND.hotdeck() random hot deck and some variants - random hot deck in classes - random hot deck in “moving” classes - it is possible to use weights rankNND.hotdeck() nearest neighbour with distance computed on the percentage points of the empirical cumulative distribution function of X SM via hot deck imputation UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

mixed.mtc() mixed SM methods for continuous variables: consist in two steps: 1) fits regression models (regression) Y vs. X and Z vs. X 2) fills A with units chosen by means of constrained distance hot deck computed on intermediate and live values of Y and Z - two methods to estimate regression parameters: (ML and Moriarity&Scheuren, 2001) - possibility of introducing auxiliary information about the correlation coef. between Y and Z Mixed SM methods UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Renssen’s (1998) approach based on a series of calibration steps of the survey weights of A and B, and if available C ( C may contain Y and Z or X, Y and Z ) harmonize.x() harmonizes the joint/marginal distribution of X variables in A and B comb.samples() estimates the contingency table Y vs. Z using available auxiliary information in C (when available): - Conditional Independence Assum. - incomplete two way stratification - synthetic two way stratification SM of data from complex sample surveys UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Frechet.bounds.cat() to derive the uncertainty bounds for frequencies in the contingency table Y vs. Z, starting from the marginal tables X vs. Y and X vs. Z Fbwidths.by.x() explores how the various possible subsets of the X variables contribute in reducing the uncertainty on the cells of Y vs. Z Exploring uncertainty due to SM basic framework UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Computational Efficiency Hot deck methods StatMatch Function # Match vars # Imp. class. Process time (secs) Notes UNconstrain ed NND NND.hotdeck 4361282 dist.fun= ” Gower ” Constrained NND NND.hotdeck 4361446 dist.fun= ” Gower ” constr.alg= ” rela x ” Random hot deck RANDwNND.hotdeck 4361936 dist.fun= ” Gower ” cut.don="exact “ k=10 Artificial data: A contains 14,000 obs.; about 54,000 obs. in B. PC with CPU Pentium IV 3GHz, 3GB RAM, MS Windows XP Prof. (SP 3; 32bit) All the functions in StatMatch are based on R code and there are no calls to other external code (compiled C or Fortran): “Interpreted languages (Matlab, R, Python, Lisp) are fun... but slow. Compiled languages (machine code, assembly, FORTRAN, C, Java) are fast… but are work (= no fun)” Mizera (2006) UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Warning! “Although abusing R was not proved to be addictive, it should be noted that it often leads to harder stuff” Mizera (2006) Thank You for Your attention! Ljubljana, 9-11 May 2011

Some References D'Orazio, M. (2009). StatMatch: Statistical Matching. R package version 1.0.1. http://CRAN.R-project.org/package=StatMatch D’Orazio, M., Di Zio, M., and Scanu, M. (2006) Statistical Matching: Theory and Practice. Wiley and Sons, Chichester. Mizera, I. (2006) “Graphical Exploratory Analysis Using Halfspace Depth”. Presentation at “useR!, The R User Conference 2006”, Wien, 15-17 June 2006. Moriarity C., Scheuren F. (2001) “Statistical matching: a paradigm for assessing the uncertainty in the procedure”. Journal of Official Statistics, 17, 407–422. Renssen, R.H. (1998) “Use of Statistical matching techniques in calibration estimation” Survey Methodology, 24, pp. 171-183. UNECE Work Session on Statistical Data Editing Ljubljana, 9-11 May 2011

Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical.

Similar presentations

Presentation on theme: "Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical.

Similar presentations

Presentation on theme: "Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical."— Presentation transcript:

Similar presentations

About project

Feedback