Outline The conditional independence model(CIA) Parametric macro methods The normal case Maximum likelihood Parametric micro methods Conditional mean matching Random draw Nonparametric macro methods Nonparametric micro methods Random hot deck Conditional random hot deck Rank hot deck Distance hot deck Constrained hot deck References
A first identifiable model Let us consider the class of models F for (X,Y,Z) to the following set: where fY|X is the conditional density of Y given X, fZ|X is the conditional density of Z given X fX is the marginal density of X. Consequence 1: this class of distributions for (X,Y,Z) is the conditional independence of Y and Z given X (CIA). Consequence 2: this model is identifiable for A B Note: this is not the only identifiable model! Help: this model can be useful in many different cases (use of proxy variables and uncertainty)! 3
The different matching contexts Output Approach Parametric Nonparametric Macro Micro Let’s tackle this problem in a familiar context for inferential statistics: data are drawn according to a probability law that follows a parametric model, and the objective is macro. In the following we will mainly consider two distributions: the normal and the multinomial 4
Parametric macro methods In a parametric model, each probability law that can generate our sample data can be described by a finite number of parameters. Under the CIA, given the sample A B, the likelihood function becomes: 5
Parametric macro methods Parameter estimation becomes straightforward: Use sample AB for estimating Use A for estimating Use B for estimating 6
Parametric macro methods: the normal case Let (X,Y,Z) be a three-variate normal r.v. with parameters: Under the CIA, the parameter YZ is superfluous For the statistical matching problem, it is convenient to consider the equivalent distribution defined by this parameterization: X, Y|X, Z|X. 7
Parametric macro methods: the normal case Estimates for the re-parameterization 8
Parametric macro methods: the normal case For the estimates of the parameters of the marginal distribution of X, the whole sample AB can be used 9
Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Y given X, only sample A can be used Hence, the marginal parameters for Y are: 10
Parametric macro methods: the normal case For the estimates of the parameters of the distribution of Z given X, only sample B can be used Hence, the marginal parameters for Z are: 11
Comment: why maximum likelihood estimation? What happens if, instead of the previous maximum likelihood parameter estimation, we consider a direct estimation from the data set where the corresponding variable(s) are observed? For instance, let’s consider the case of a direct estimation of the Y mean value on A (i.e. with the sample average of Y in A) instead of using 𝜇 𝑌 (a kind of regression estimate in a double sampling) Where 𝜌𝑋𝑌 is the correlation coefficient between X and Y. The maximum likelihood estimator is much more efficient when B sample size increases and X and Y are highly correlated. 12
Comment: why maximum likelihood estimation? When parameters are estimated distinctly on the part of A B that is complete for the corresponding r.v., it might happen that the estimates are not coherent. For instance, the estimated variance and covariance matrix for (X, Y) can be negative definite! This does not happen in the simultaneous estimation of all the parameters by means of the maximum likelihood estimation 13
Example 14
Example Under the CIA the maximum likelihood estimate of the parameters are: From the previous estimates we get: 15
Parametric macro methods: the multinomial case Let (X,Y,Z) be a multinomial r.v. with parameters: where is a vector of parameters with the following characteristics 16
Parametric macro methods: the multinomial case Adopting the same re-parameterization of the joint distribution, under the CIA the parameters of interest are: In this context, the parameters of the joint distribution are computed according to the following formulas When the interest is only on the pairwise distribution (Y,Z) 17
Parametric macro methods: the multinomial case Given the sample AB, the maximum likelihood estimator is 18
Example Let’s consider the following two samples A and B, where I=2, J=2, K=3. 19
Example The maximum likelihood estimates of the parameters are: 20
Example The maximum likelihood estimates of the parameters of the joint distribution are: 21
Parametric macro methods: conclusions The CIA model is identifiable (i.e. with a unique estimate) for the data set AB The application of the maximum likelihood estimator is very easy Even if the problem is characterized by missing data, the problem can be split in three «complete data» subproblems, one for each parameter of the re-parameterization There can be other estimation methods can be statistically consistent, but incoherent 22
The different matching contexts Output Approach Parametric Nonparametric Macro Micro We are still in the familiar context of parametric data models, but now the objective is micro, i.e. we are interested in a complete data set where (X, Y, Z) are jointly available. This is the context where imputation methods are usually used! 24
Parametric micro methods Objective: to create a complete data set for (X, Y, Z) Context: partially observed data set 25
Parametric micro methods Method: imputation of missing values. In a parametric context: Estimate the distribution parameters Take a (not necessarily random) value from the estimated distribution 26
Parametric micro methods: conditional mean matching A first method consist in covering the missing values with the corresponding expected frequencies The unkown parameters can be substituted by the estimates already discussed in the parametric macro methods 27
Parametric micro methods: conditional mean matching Example: consider the normal case Imputations will be perfomed using the estimated regression functions Comments: Each imputation is the value whose distance with the shortest distance from all the possible values according to the estimated distribution (good if the purpose is to study unit characteristics, not population characteristics) The imputed values might not be «live» values These imputed values shrink the variability of the imputed variable In other words: is the complete data set an optimal one? 28
Example 29
Example The imputed Z variance is just a bit smaller than the observed one in B 30
Example The imputed Y variance is much smaller than the observed one in A (30,21 instead of 179,41). Why? 31
Parametric micro methods: conditional random draw In order to preserve as much as possible the observed distributions, these are the steps to follow: Estimate the parameters according to a parametric macro method For each a, a=1,…,nA, generate a random draw 𝑧 𝑎 from For each b, a=1,…,nB, generate a random draw 𝑦 𝑏 from 32
Parametric micro methods: conditional random draw Example: normal case Where e is a value generated from a normal distribution with zero mean and respectively and variance respectively Example: multinomial case Impute values in A through a random draw from the distribution Impute values in B through a random draw from the distribution 33
Parametric micro methods: conclusions Parametric micro methods are based on the estimates obtained under the macro methods Caution on the variability of the imputed variables in the complete sample Results for the second point have been found under the name of matching noise 34
Non parametric macro methods Output Approach Parametric Nonparametric Macro Micro The family of distributions where the distribution of (X, Y, Z) belongs cannot be represented by a finite number of parameters. Although the statistical literature on nonparametrics is huge, this is by far the most neglected situation in statistical matching! Anyway we check it in order to link macro and micro methods, as for the parametric case 36
Non parametric macro methods We will not consider the case of Y and Z categorical, because it will mainly consist of the already described parametric methods. W will mainly refer to two easy nonparametric macro estimates, corresponding to the case X is categorical and numerical, respectively X categorical, Y and Z ordered or numerical: estimation of the empirical cumulative distribution X, Y and Z numerical: estimation of the nonparametric regression function As a matter of fact, the first approach will be helpful for the random generation of imputations in the corresponding micro methods, the second for the conditional mean matching and, whenever possible, adding a random residual 37
Empirical cumulative distribution function Under the CIA, the cumulative distribution of Y and Z given X can be written as: Each factor can be estimated respectively from A and B: 38
Nonparametric regression function Let us assume that: Z = r( X ) + where is such that E( | X )=0. If the function r ( .. ) is linear, then we get the usual linear regression function already studied in the parametric macro methods under the assumption of normality. In fact, here we do not restrict r ( .. ) to belong to a specific parametric family. 39
Non parametric regression function - kNN Let us just consider the estimation of the regression function of Z on X (the results for Y are similar). As already seen in the parametric case, this regression function can be estimated restricting attention to sample B only. Estimation can be performed by means of the k Nearest Neighbour (k-NN) estimator Where 𝑊 𝑘𝑏 , 𝑏=1,…, 𝑛 𝑏 , is a sequence of weights assigned to the units in B according to ordering them via 𝑥− 𝑥 𝑏 from the smallest to the highest values Jx contains the first k unit labels of the ordering! 40
Non parametric regression function - kNN In practice the expected value of Z given X=x is obtained by averaging the Z values of the k nearest observations to X=x. Conclusions nonparametric macro methods Although not used in practice, this part is useful in order to understand what happens in the widely used nonparametric micro methods Mainly two problems have been shown: estimation of the empirical cumulative distribution function and of the nonparametric regression 41
Non parametric micro methods Output Approach Parametric Nonparametric Macro Micro Who applied these methods, seldom assumed anything about the distribution of (X, Y, Z) Each micro method has a macro counterpart, i.e. a representation of how the distribution of (X, Y, Z) should be done. The problem is: to be aware or not? 43
Non parametric micro methods The nonparametric micro matching methods consist of essentially three imputation procedures Random hot deck Rank hot deck Distance hot deck As already seen in the parametric case, each one of these methods correspond to a specific nonparametric macro approach of the distribution 𝑓 𝑥,𝑦,𝑧 or of a characteristic value. In general, these methods do not organize the two data sets A and B as a unique sample AB. 44
Parametric micro methods A is the recipient file and these are the data to impute B is the donor file and these are the data to use for imputation Parametric micro methods The idea is to consider a file as a recipient and the other as the donor 45
Example In order to define the different hot deck methods, let’s consider an example Example: let A and B be the following ones A : 𝑛 𝐴 =6, observed variables: Gender, Age, Income B : 𝑛 𝐵 =10, observed variables: Gender, Age, Expenditures A=recipient B=donor Common variables X=( 𝑋 1 =Gender, 𝑋 2 =Age) Y=(Income) Z=(Expenditures) 46
Example 47
Random hot deck: the method Let us draw one random value from B and assign it to the first value to impute in A. Follow the same procedure for all the 𝑎∈𝐴 Example: In general we have 𝑛𝐵 𝑛𝐴 = 10 6 possible different ways to impute A 48
Conditional random hot deck: the method Let’s fix a conditional variable, e.g. 𝑋 1 For the first record a=1, let us draw one random value from the subset of units in B that 𝑋 1 𝑏 =𝐹. Follow the same procedure for all the 𝑎∈𝐴 Example: The number of different completed data sets we can get is 𝑚𝐵 𝑚𝐴 + 𝑛 𝐵 −𝑚𝐵 𝑛𝐴−𝑚𝐴 = 6 4 + 4 2 =1312 49
Comments Random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z Conditional random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z|X It is possible to eliminate the already drawn value from the set of possible donors (constrained procedure), anyway the preservation of the observed distribution of Z or Z|X in B is geopardized 50
Rank hot deck Let’s assume that 𝑛 𝐵 =𝑘 𝑛 𝐴 , k integer. Compute the empirical cumulative distribution functions To each 𝑎∈𝐴 assign 𝑏 ∗ ∈𝐵 chosen so that In other words, this method imputes the values whose quintiles of X are similar in A and B respectively 51
Rank hot deck Rank the two sample A and B according to X1 52
Rank hot deck These are the values of the empirical cumulative distribution function of X1 in A and B respectively 53
Rank hot deck This is the result In this example, there is only one way to impute a value 54
Distance hot deck To each 𝑎∈𝐴 assign 𝑏 ∗ ∈𝐵 chosen so that it is the nearest according to the common variables. This method depends on the distance function. Different choices are available. If X is numeric, it is possible to choose the Manhattan distance Other distances can be the Euclidean,… If X is multivariate, the available distances are the Mahalanobis, Canberra,… If X is categorical and unordered, it is possible to consider the classes of imputation (i.e. the distance is the «equality») 55
Distance hot deck - example Let’s consider X2 as the variable to use in order to compute distances (i.e. we choose as donors those records in B whose age is the most similar to the one in A) Choose one value at random if there are more than one same distance donors The overall distance between donor and recipients is 56
Constrained distance hot deck In the former procedure, a donor can be chosen more than once if it is the nearest of more than one record in A. In order to resue the same information more than once, the following constrained procedure has been defined Minimize Under the constraints 57
Constrained distance hot deck: advantages and disadvantages Constrained matching allows to preserve the marginal distribution of the variable to impute (Z) Constrained distance hot deck is characterized by a larger distance between donors and recipients than the one for distance hot deck 58
Constrained distance hot deck: example The overall donor recipient distance is 59
Constrained distance hot deck: comments Distance hot deck is equivalent to the estimation of a nonparametric regression function via the kNN method, when k=1. These methods produce always live data as imputations Sometimes, parametric and nonparametric procedures are applied together: mixed methods Example: Regression step - impute intermediate values in A and B Matching step – use a distance hot deck by selecting b* with the shortest distance 60
