Statistical matching under the conditional independence assumption Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept.

Slides:



Advertisements
Similar presentations
Copula Representation of Joint Risk Driver Distribution
Advertisements

Non response and missing data in longitudinal surveys.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Visual Recognition Tutorial
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML) and likelihood ratio (LR) test
Visual Recognition Tutorial
Linear and generalised linear models
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,
Marcello D’Orazio UNECE - Work Session on Statistical Data Editing Ljubljana, Slovenia, 9-11 May 2011 Statistical.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Gaussian Processes Li An Li An
Stats 845 Applied Statistics. This Course will cover: 1.Regression –Non Linear Regression –Multiple Regression 2.Analysis of Variance and Experimental.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
Methods and software for editing and imputation: recent advancements at Istat M. Di Zio, U. Guarnera, O. Luzi, A. Manzari ISTAT – Italian Statistical Institute.
Tutorial I: Missing Value Analysis
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Stats 242.3(02) Statistical Theory and Methodology.
Estimating standard error using bootstrap
Introduction to Quantitative Research
Inference about the slope parameter and correlation
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard)   Week 5 Multiple Regression  
Chapter 7. Classification and Prediction
Theme (i): New and emerging methods
Deep Feedforward Networks
Structural Equation Modeling using MPlus
12. Principles of Parameter Estimation
Statistical Analysis Urmia University
CH 5: Multivariate Methods
Classification of unlabeled data:
Clustering Evaluation The EM Algorithm
Roberto Battiti, Mauro Brunato
How to handle missing data values
Hidden Markov Models Part 2: Algorithms
Stat 217 – Day 28 Review Stat 217.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
The European Statistical Training Programme (ESTP)
Chapter 8: Weighting adjustment
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005
Generally Discriminant Analysis
Marco Di Zio Dept. Integration, Quality, Research and Production
Fixed, Random and Mixed effects
Preliminaries Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality, Research and Production Networks.
Non response and missing data in longitudinal surveys
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
12. Principles of Parameter Estimation
How to Choose the Matching Variables Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic.
Creating a synthetic database for research in migration and subjective well-being Statistical Matching techniques for combining the complementary questionnaires.
Chapter 13: Item nonresponse
Probabilistic Surrogate Models
Presentation transcript:

Statistical matching under the conditional independence assumption Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality, Research and Production Networks Development, Istat scanu [at] istat.it

Outline The conditional independence model(CIA) Parametric macro methods The normal case Maximum likelihood Parametric micro methods Conditional mean matching Random draw Nonparametric macro methods Nonparametric micro methods Random hot deck Conditional random hot deck Rank hot deck Distance hot deck Constrained hot deck References

A first identifiable model Let us consider the class of models F for (X,Y,Z) to the following set: where fY|X is the conditional density of Y given X, fZ|X is the conditional density of Z given X fX is the marginal density of X. Consequence 1: this class of distributions for (X,Y,Z) is the conditional independence of Y and Z given X (CIA). Consequence 2: this model is identifiable for A  B Note: this is not the only identifiable model! Help: this model can be useful in many different cases (use of proxy variables and uncertainty)! 3

The different matching contexts Output Approach Parametric Nonparametric Macro  Micro Let’s tackle this problem in a familiar context for inferential statistics: data are drawn according to a probability law that follows a parametric model, and the objective is macro. In the following we will mainly consider two distributions: the normal and the multinomial 4

Parametric macro methods In a parametric model, each probability law that can generate our sample data can be described by a finite number of parameters. Under the CIA, given the sample A  B, the likelihood function becomes: 5

Parametric macro methods Parameter estimation becomes straightforward: Use sample AB for estimating Use A for estimating Use B for estimating 6

Parametric macro methods: the normal case Let (X,Y,Z) be a three-variate normal r.v. with parameters: Under the CIA, the parameter YZ is superfluous For the statistical matching problem, it is convenient to consider the equivalent distribution defined by this parameterization: X, Y|X, Z|X. 7

Parametric macro methods: the normal case Estimates for the re-parameterization 8

Parametric macro methods: the normal case For the estimates of the parameters of the marginal distribution of X, the whole sample AB can be used 9

Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Y given X, only sample A can be used Hence, the marginal parameters for Y are: 10

Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Z given X, only sample B can be used Hence, the marginal parameters for Z are: 11

Comment: why maximum likelihood estimation? What happens if, instead of the previous maximum likelihood parameter estimation, we consider a direct estimation from the data set where the corresponding variable(s) are observed? For instance, let’s consider the case of a direct estimation of the Y mean value on A (i.e. with the sample average of Y in A) instead of using 𝜇 𝑌 (a kind of regression estimate in a double sampling) Where 𝜌𝑋𝑌 is the correlation coefficient between X and Y. The maximum likelihood estimator is much more efficient when B sample size increases and X and Y are highly correlated. 12

Comment: why maximum likelihood estimation? When parameters are estimated distinctly on the part of A B that is complete for the corresponding r.v., it might happen that the estimates are not coherent. For instance, the estimated variance and covariance matrix for (X, Y) can be negative definite! This does not happen in the simultaneous estimation of all the parameters by means of the maximum likelihood estimation 13

Example 14

Example Under the CIA the maximum likelihood estimate of the parameters are: From the previous estimates we get: 15

Parametric macro methods: the multinomial case Let (X,Y,Z) be a multinomial r.v. with parameters: where is a vector of parameters with the following characteristics 16

Parametric macro methods: the multinomial case Adopting the same re-parameterization of the joint distribution, under the CIA the parameters of interest are: In this context, the parameters of the joint distribution are computed according to the following formulas When the interest is only on the pairwise distribution (Y,Z) 17

Parametric macro methods: the multinomial case Given the sample AB, the maximum likelihood estimator is 18

Example Let’s consider the following two samples A and B, where I=2, J=2, K=3. 19

Example The maximum likelihood estimates of the parameters are: 20

Example The maximum likelihood estimates of the parameters of the joint distribution are: 21

Parametric macro methods: conclusions The CIA model is identifiable (i.e. with a unique estimate) for the data set AB The application of the maximum likelihood estimator is very easy Even if the problem is characterized by missing data, the problem can be split in three «complete data» subproblems, one for each parameter of the re-parameterization There can be other estimation methods can be statistically consistent, but incoherent 22

Selected references Anderson T W (1957) ``Maximum likelihood estimates for a multivariate normal distribution when some observations are missing'', JASA, 52, 200—203 Anderson T W (1984) An Introduction to Multivariate Statistical Analysis}, Wiley Rubin D B (1974)``Characterizing the Estimation of Parameters in Incomplete--Data Problems'', JASA, 69, 467—474 D'Orazio, M., Di Zio, M. and Scanu, M. (2006) Statistical matching for categorical data: displaying uncertainty and using logical constraints. JOS, 22, 137-157 Moriarity C, Scheuren F (2001)``Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure'', JOS, 17, 407--422

The different matching contexts Output Approach Parametric Nonparametric Macro  Micro We are still in the familiar context of parametric data models, but now the objective is micro, i.e. we are interested in a complete data set where (X, Y, Z) are jointly available. This is the context where imputation methods are usually used! 24

Parametric micro methods Objective: to create a complete data set for (X, Y, Z) Context: partially observed data set 25

Parametric micro methods Method: imputation of missing values. In a parametric context: Estimate the distribution parameters Take a (not necessarily random) value from the estimated distribution 26

Parametric micro methods: conditional mean matching A first method consist in covering the missing values with the corresponding expected frequencies The unkown parameters can be substituted by the estimates already discussed in the parametric macro methods 27

Parametric micro methods: conditional mean matching Example: consider the normal case Imputations will be perfomed using the estimated regression functions Comments: Each imputation is the value whose distance with the shortest distance from all the possible values according to the estimated distribution (good if the purpose is to study unit characteristics, not population characteristics) The imputed values might not be «live» values These imputed values shrink the variability of the imputed variable In other words: is the complete data set an optimal one? 28

Example 29

Example The imputed Z variance is just a bit smaller than the observed one in B 30

Example The imputed Y variance is much smaller than the observed one in A (30,21 instead of 179,41). Why? 31

Parametric micro methods: conditional random draw In order to preserve as much as possible the observed distributions, these are the steps to follow: Estimate the parameters according to a parametric macro method For each a, a=1,…,nA, generate a random draw 𝑧 𝑎 from For each b, a=1,…,nB, generate a random draw 𝑦 𝑏 from 32

Parametric micro methods: conditional random draw Example: normal case Where e is a value generated from a normal distribution with zero mean and respectively and variance respectively Example: multinomial case Impute values in A through a random draw from the distribution Impute values in B through a random draw from the distribution 33

Parametric micro methods: conclusions Parametric micro methods are based on the estimates obtained under the macro methods Caution on the variability of the imputed variables in the complete sample Results for the second point have been found under the name of matching noise 34

Selected references Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2nd edition, Wiley Rubin D B (1986) “Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations”, Journal of Business and Economic Statistics, 4, 87–94 Kadane J B (1978) “Some Statistical Problems in Merging Data Files”, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159–179. Republished on Journal of Official Statistics, 17, 423–433. Marella D., Scanu M., Conti P.L. (2008). “On the matching noise of some nonparametric imputation procedures”, Statistics and Probability Letters, 78, 1593-1600. Conti P.L., Marella D., Scanu M. (2008). “Evaluation of matching noise for imputation techniques based on the local linear regression estimator”. Computational Statistics and Data Analysis, 53, 354-365.

Non parametric macro methods Output Approach Parametric Nonparametric Macro  Micro The family of distributions where the distribution of (X, Y, Z) belongs cannot be represented by a finite number of parameters. Although the statistical literature on nonparametrics is huge, this is by far the most neglected situation in statistical matching! Anyway we check it in order to link macro and micro methods, as for the parametric case 36

Non parametric macro methods We will not consider the case of Y and Z categorical, because it will mainly consist of the already described parametric methods. W will mainly refer to two easy nonparametric macro estimates, corresponding to the case X is categorical and numerical, respectively X categorical, Y and Z ordered or numerical: estimation of the empirical cumulative distribution X, Y and Z numerical: estimation of the nonparametric regression function As a matter of fact, the first approach will be helpful for the random generation of imputations in the corresponding micro methods, the second for the conditional mean matching and, whenever possible, adding a random residual 37

Empirical cumulative distribution function Under the CIA, the cumulative distribution of Y and Z given X can be written as: Each factor can be estimated respectively from A and B: 38

Nonparametric regression function Let us assume that: Z = r( X ) +  where  is such that E(  | X )=0. If the function r ( .. ) is linear, then we get the usual linear regression function already studied in the parametric macro methods under the assumption of normality. In fact, here we do not restrict r ( .. ) to belong to a specific parametric family. 39

Non parametric regression function - kNN Let us just consider the estimation of the regression function of Z on X (the results for Y are similar). As already seen in the parametric case, this regression function can be estimated restricting attention to sample B only. Estimation can be performed by means of the k Nearest Neighbour (k-NN) estimator Where 𝑊 𝑘𝑏 , 𝑏=1,…, 𝑛 𝑏 , is a sequence of weights assigned to the units in B according to ordering them via 𝑥− 𝑥 𝑏 from the smallest to the highest values Jx contains the first k unit labels of the ordering! 40

Non parametric regression function - kNN In practice the expected value of Z given X=x is obtained by averaging the Z values of the k nearest observations to X=x. Conclusions nonparametric macro methods Although not used in practice, this part is useful in order to understand what happens in the widely used nonparametric micro methods Mainly two problems have been shown: estimation of the empirical cumulative distribution function and of the nonparametric regression 41

Selected references Wand M, Jones C (1995) Kernel Smoothing, Chapman & Hall. Härdle W (1992) Applied Nonparametric Regression, Cambridge University Press Paass G (1985) “Statistical record linkage methodology, state of the art and future prospects”, in Bulletin of the International Statistical Institute, Proceedings of the 45th Session, volume LI, Book 2 Marella D., Scanu M., Conti P.L. (2008) “On the matching noise of some nonparametric imputation procedures”, Statistics and Probability Letters, 78, 1593–1600 Conti P.L., Marella D., Scanu M. (2008). “Evaluation of matching noise for imputation techniques based on the local linear regression estimator”. Computational Statistics and Data Analysis, 53, 354–365

Non parametric micro methods Output Approach Parametric Nonparametric Macro  Micro Who applied these methods, seldom assumed anything about the distribution of (X, Y, Z) Each micro method has a macro counterpart, i.e. a representation of how the distribution of (X, Y, Z) should be done. The problem is: to be aware or not? 43

Non parametric micro methods The nonparametric micro matching methods consist of essentially three imputation procedures Random hot deck Rank hot deck Distance hot deck As already seen in the parametric case, each one of these methods correspond to a specific nonparametric macro approach of the distribution 𝑓 𝑥,𝑦,𝑧 or of a characteristic value. In general, these methods do not organize the two data sets A and B as a unique sample AB. 44

Parametric micro methods A is the recipient file and these are the data to impute B is the donor file and these are the data to use for imputation Parametric micro methods The idea is to consider a file as a recipient and the other as the donor 45

Example In order to define the different hot deck methods, let’s consider an example Example: let A and B be the following ones A : 𝑛 𝐴 =6, observed variables: Gender, Age, Income B : 𝑛 𝐵 =10, observed variables: Gender, Age, Expenditures A=recipient B=donor Common variables X=( 𝑋 1 =Gender, 𝑋 2 =Age) Y=(Income) Z=(Expenditures) 46

Example 47

Random hot deck: the method Let us draw one random value from B and assign it to the first value to impute in A. Follow the same procedure for all the 𝑎∈𝐴 Example: In general we have 𝑛𝐵 𝑛𝐴 = 10 6 possible different ways to impute A 48

Conditional random hot deck: the method Let’s fix a conditional variable, e.g. 𝑋 1 For the first record a=1, let us draw one random value from the subset of units in B that 𝑋 1 𝑏 =𝐹. Follow the same procedure for all the 𝑎∈𝐴 Example: The number of different completed data sets we can get is 𝑚𝐵 𝑚𝐴 + 𝑛 𝐵 −𝑚𝐵 𝑛𝐴−𝑚𝐴 = 6 4 + 4 2 =1312 49

Comments Random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z Conditional random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z|X It is possible to eliminate the already drawn value from the set of possible donors (constrained procedure), anyway the preservation of the observed distribution of Z or Z|X in B is geopardized 50

Rank hot deck Let’s assume that 𝑛 𝐵 =𝑘 𝑛 𝐴 , k integer. Compute the empirical cumulative distribution functions To each 𝑎∈𝐴 assign 𝑏 ∗ ∈𝐵 chosen so that In other words, this method imputes the values whose quintiles of X are similar in A and B respectively 51

Rank hot deck Rank the two sample A and B according to X1 52

Rank hot deck These are the values of the empirical cumulative distribution function of X1 in A and B respectively 53

Rank hot deck This is the result In this example, there is only one way to impute a value 54

Distance hot deck To each 𝑎∈𝐴 assign 𝑏 ∗ ∈𝐵 chosen so that it is the nearest according to the common variables. This method depends on the distance function. Different choices are available. If X is numeric, it is possible to choose the Manhattan distance Other distances can be the Euclidean,… If X is multivariate, the available distances are the Mahalanobis, Canberra,… If X is categorical and unordered, it is possible to consider the classes of imputation (i.e. the distance is the «equality») 55

Distance hot deck - example Let’s consider X2 as the variable to use in order to compute distances (i.e. we choose as donors those records in B whose age is the most similar to the one in A) Choose one value at random if there are more than one same distance donors The overall distance between donor and recipients is 56

Constrained distance hot deck In the former procedure, a donor can be chosen more than once if it is the nearest of more than one record in A. In order to resue the same information more than once, the following constrained procedure has been defined Minimize Under the constraints 57

Constrained distance hot deck: advantages and disadvantages Constrained matching allows to preserve the marginal distribution of the variable to impute (Z) Constrained distance hot deck is characterized by a larger distance between donors and recipients than the one for distance hot deck 58

Constrained distance hot deck: example The overall donor recipient distance is 59

Constrained distance hot deck: comments Distance hot deck is equivalent to the estimation of a nonparametric regression function via the kNN method, when k=1. These methods produce always live data as imputations Sometimes, parametric and nonparametric procedures are applied together: mixed methods Example: Regression step - impute intermediate values in A and B Matching step – use a distance hot deck by selecting b* with the shortest distance 60

Selected references Kadane J B (1978) “Some Statistical Problems in Merging Data Files”, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159–179. Published also on Journal of Official Statistics, 17, 423–433. Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2nd edition, Wiley Okner B A (1972) “Constructing a new data base from existing microdata sets: the 1966 merge file”, Annals of Economic and Social Measurement, 1, 325–342 Rodgers W L (1984) “An Evaluation of Statistical Matching”, Journal of Business and Economic Statistics, 2, 91–102 Rubin D B (1986) “Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations”, Journal of Business and Economic Statistics, 4, 87–94 Sims C A (1972), “Comments on Okner”, Annals of Economic and Social Measurement, 1, 343–345 Singh A C, Mantel H, Kinack M, Rowe G (1993) “Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption”, Survey Methodology, 19, 59–79 Marella D., Scanu M., Conti P.L. (2008). “On the matching noise of some nonparametric imputation procedures”, Statistics and Probability Letters, 78, 1593-1600. Conti P.L., Marella D., Scanu M. (2008). “Evaluation of matching noise for imputation techniques based on the local linear regression estimator”. Computational Statistics and Data Analysis, 53, 354-365