Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

Non response and missing data in longitudinal surveys.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
When adjusting for bias due to linkage errors: a sensitivity analysis Q2014 Tiziana Tuoto 05/06/2014 Joint work with Loredana Di Consiglio.
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
15 PARTIAL DERIVATIVES.
Parametric Inference.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Prediction and model selection
Bayesian Learning, Cont’d. Administrivia Various homework bugs: Due: Oct 12 (Tues) not 9 (Sat) Problem 3 should read: (duh) (some) info on naive Bayes.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Eurostat Statistical Data Editing and Imputation.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Statistical Decision Theory
Model Inference and Averaging
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
1 Multiple Imputation : Handling Interactions Michael Spratt.
A Generalization of PCA to the Exponential Family Collins, Dasgupta and Schapire Presented by Guy Lebanon.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
The new multiple-source system for Italian Structural Business Statistics based on administrative and survey data Orietta Luzi, Ugo Guarnera, Paolo Righi.
Danila Filipponi Simonetta Cozzi ISTAT, Italy Outlier Identification Procedures for Contingency Tables in Longitudinal Data Roma,8-11 July 2008.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Predictive Mean Matching using a Factor Model, Varriale - Guarnera – Nuremberg, 09/09/2013 Predictive Mean Matching using a Factor Model, an application.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Eurostat R and the package StatMatch Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic.
Evaluating the Quality of Editing and Imputation: the Simulation Approach M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Project Plan Task 8 and VERSUS2 Installation problems Anatoly Myravyev and Anastasia Bundel, Hydrometcenter of Russia March 2010.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Data Modeling Patrice Koehl Department of Biological Sciences
Oliver Schulte Machine Learning 726
Modeling approaches for the allocation of costs
Model Inference and Averaging
Maximum Likelihood Estimation
CSCI 5822 Probabilistic Models of Human and Machine Learning
Statistical matching under the conditional independence assumption Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The European Statistical Training Programme (ESTP)
Task 6 Statistical Approaches
Marco Di Zio Dept. Integration, Quality, Research and Production
Preliminaries Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality, Research and Production Networks.
Non response and missing data in longitudinal surveys
Parametric Methods Berlin Chen, 2005 References:
Chapter 13: Item nonresponse
Optimization under Uncertainty
Presentation transcript:

Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration, Quality, Research and Production Networks Development Department, Istat dizio [at] istat.it

Eurostat Outline  The problem  Auxiliary information  Auxiliary information in parametric models  Auxiliary information in nonparametric models  References

Eurostat The problem  Let A U B be a sample of n A + n B observations i.i.d. from f(x, y, z), with Z missing on records of A, and Y missing on records of B.  Two alternative models are identifiable for A U B : the CIA and the PIA.  The reason is that those models involve only the distribution of X, Y|X and Z|X.  When the CIA (or PIA) is not adapt to our problem it is necessary to use auxiliary information (if we want a point estimate).

Eurostat Example. The normal case  (X,Y,Z) ~ N(   The inestimable parameter is  yz (or equivalently  yz )  Under the CIA this is  yz =  xy  xz /   yz  In general it holds  yz =  xy  xz /  2 yz +  yz|x  We need information to fill the gap  yz|x =? (or  yz|x )

Eurostat Regression where

Eurostat Auxiliary information  In general two different kinds of auxiliary information: 1)a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2)a plausible value of the inestimable parameters of either (Y,Z|X) or (Y,Z)

Eurostat Sources Sources may not be perfect:  an outdated statistical investigation;  administrative register;  a supplemental (even small) ad hoc survey;  proxy variables (Y°,Z°)

Eurostat Auxiliary information on parameters Previous surveys, assumptions made by the researcher, proxy variables may suggest a value  * for the non estimable parameters. Two kinds of information:  information about  yz|x  Information about  yz

Eurostat Auxiliary information on parameters  Consequences of information on parameters.  It restricts the parameter space  to a subspace  *   * involves all the param. in  compatible with the auxiliary information

Eurostat Auxiliary information and likelihood  Combining estimates and auxiliary information is easier when info is about  yz|sx  In general, the pdf f(x, y, z; θ) may be written as: f (x, y, z; θ ) = f X (x; θ X ) f YZ|X (y, z|x; θ YZ|X )  where x  X, y  Y, z  Z and the paramet. space  {  }  Reparametrised in two sets  X = { θ X },  YZ|X = { θ YZ|X }.

Eurostat Auxiliary information about  yz|x This information is precious but rarely available.  An interesting case is when (X, Y,Z)~ N(μ,  ).  In this case the only information required is on ρ Y Z|X. Algorithm for the MLE  estimate θ X on A U B  estimate θ Y |X on A and θ Z|X on B  with the previous estimates and ρ YZ|X = ρ* YZ|X we can compute  and

Eurostat Auxiliary information about  yz  This information is more problematic  This info does not guarantee a unique MLE (see e.g., multinomial distribution).  it is not an easy task to combine this info with estimates obtained from A U B.  It requires maximum constrained approaches

Eurostat Auxiliary information about  yz  This info does not guarantee a unique MLE  We cannot estimate a log-linear model like However we can estimate

Eurostat Auxiliary information about  yz Normal distribution  This info guarantees a unique MLE The only parameter involving Y,Z is  yz. Info on it is sufficient to fill the lack of knowledge

Eurostat Auxiliary information about  yz Let us estimate  yx and  zx with and let  yz =  * yz. There are two possibilities 1) Auxiliary info is compatible with estimates

Eurostat Auxiliary information about  yz 2) Auxiliary info is NOT compatible with estimates

Eurostat Example: Auxiliary info on  yz =  yz Let us suppose that Value  * yz = 0.7 is compatible, det(  ) = while  * yz = 0.9 is not compatible, det(  ) =-0.008

Eurostat Micro approach As in the micro approach under the CIA  Conditional mean  Random draw

Eurostat Conditional mean – Normal distribution Imputation of Z in A

Eurostat Random draw Imputation of Z in A

Eurostat Non-parametric methods Auxiliary information may be an additional file C Micro Hot-deck (A recipient and B donor)  any record in A is imputed with a record from C (if a distance is used it is computed on (X, Y ) or Y if C is (X, Y,Z) or (Y Z)). The imputed record is (x a, y a, ˜z a (1) = z c* )  Z in a is imputed with a live value ˜z a (2) = z b* from B through hot-deck. If a distance is used, b*  B minimizes d((x a, z c* ), (x b, z b ))  the final data set is composed of (x a, y a, ˜z a (2) )

Eurostat Auxiliary information Auxiliary information can be 1. information on the inestimable parameters (e.g. ρ Y Z ), (as already introduced) 2. on some parameters not directly identifying the model; for instance, (X, Y,Z) are continuous but it is known the contingency table of a categorization of them This kind of auxiliary info can be dealt with by using mixed methods and non- parametric methods as well

Eurostat Mixed methods They use parametric and non-parametric approach, mainly in two steps. 1. Estimate the parametric model 2. use a hot deck procedure for the imputation of the missing data

Eurostat Mixed methods: Auxiliary file C  Regression step 1  Regression step 2  Matching step For each obs. a is imputed z b* corresponding to the nearest neighbor b* in B,

Eurostat Mixed methods: Auxiliary file C  Regression step  Matching step For each obs. a is imputed z b* corresponding to the nearest neighbor b* in B,

Eurostat Mixed methods: Auxiliary file C Categorical variables 1.Estimation step Estimate  ijk through the maximum likelihood applied to file C 2.Matching step For each obs. a it is found z b* through an hot-deck procedure. This value is used for the imputation if the corresponding estimated frequency of the cell (X=i,Y=j,Z=k) is not exceeded

Eurostat Mixed methods: ‘Coarse’ information We do not know the parameters of (X, Y, Z), but we know the contingency table for a categorization (X°, Y°, Z°) of (X, Y, Z) 1.Hot-deck step For each obs. a in A determine a ‘live’ value z c* in c* in C with respect to a distance d((x a,y a ),(x c,y c )). It is imputed only if the frequency of (X°, Y°, Z°) in A is not exceeded. Otherwise continue. 2.Matching step For each obs. a in A impute the live value z b* corresponding to the nearest neighbor b* in B with respect to the minimum distance d((x a, ~z a ), (x b,z b )).

Eurostat Selected references  Rässler S. (2002) Statistical Matching: a frequentist theory, practical applications and alternative Bayesian approaches, Springer  Moriarity C., Scheuren F. (2001)“Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure”, Jour. of Official Statistics, 17, 407–422  Moriarity C., Scheuren F. (2003) “A Note on Rubin’s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation”, Jour. of Business and Economic Statistics, 21, 65–73  Moriarity C., Scheuren F. (2004),“Regression–based statistical matching: recent developments”, Proceedings of the Section on Survey Research Methods, American Statistical Association  D’Orazio M., Di Zio M., Scanu M. (2006) “Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints”, Jour. of Official Statistics, 22, 1–22