Download presentation
Presentation is loading. Please wait.
Published byFrancis Turner Modified over 6 years ago
1
Statistical matching under the conditional independence assumption Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality, Research and Production Networks Development, Istat scanu [at] istat.it
2
Outline The conditional independence model(CIA)
Parametric macro methods The normal case Maximum likelihood Parametric micro methods Conditional mean matching Random draw Nonparametric macro methods Nonparametric micro methods Random hot deck Conditional random hot deck Rank hot deck Distance hot deck Constrained hot deck References
3
A first identifiable model
Let us consider the class of models F for (X,Y,Z) to the following set: where fY|X is the conditional density of Y given X, fZ|X is the conditional density of Z given X fX is the marginal density of X. Consequence 1: this class of distributions for (X,Y,Z) is the conditional independence of Y and Z given X (CIA). Consequence 2: this model is identifiable for A B Note: this is not the only identifiable model! Help: this model can be useful in many different cases (use of proxy variables and uncertainty)! 3
4
The different matching contexts
Output Approach Parametric Nonparametric Macro Micro Let’s tackle this problem in a familiar context for inferential statistics: data are drawn according to a probability law that follows a parametric model, and the objective is macro. In the following we will mainly consider two distributions: the normal and the multinomial 4
5
Parametric macro methods
In a parametric model, each probability law that can generate our sample data can be described by a finite number of parameters. Under the CIA, given the sample A B, the likelihood function becomes: 5
6
Parametric macro methods
Parameter estimation becomes straightforward: Use sample AB for estimating Use A for estimating Use B for estimating 6
7
Parametric macro methods: the normal case
Let (X,Y,Z) be a three-variate normal r.v. with parameters: Under the CIA, the parameter YZ is superfluous For the statistical matching problem, it is convenient to consider the equivalent distribution defined by this parameterization: X, Y|X, Z|X. 7
8
Parametric macro methods: the normal case
Estimates for the re-parameterization 8
9
Parametric macro methods: the normal case
For the estimates of the parameters of the marginal distribution of X, the whole sample AB can be used 9
10
Parametric macro methods: the normal case
For the estimates of the parameters of the distribution of Y given X, only sample A can be used Hence, the marginal parameters for Y are: 10
11
Parametric macro methods: the normal case
For the estimates of the parameters of the distribution of Z given X, only sample B can be used Hence, the marginal parameters for Z are: 11
12
Comment: why maximum likelihood estimation?
What happens if, instead of the previous maximum likelihood parameter estimation, we consider a direct estimation from the data set where the corresponding variable(s) are observed? For instance, let’s consider the case of a direct estimation of the Y mean value on A (i.e. with the sample average of Y in A) instead of using 𝜇 𝑌 (a kind of regression estimate in a double sampling) Where 𝜌𝑋𝑌 is the correlation coefficient between X and Y. The maximum likelihood estimator is much more efficient when B sample size increases and X and Y are highly correlated. 12
13
Comment: why maximum likelihood estimation?
When parameters are estimated distinctly on the part of A B that is complete for the corresponding r.v., it might happen that the estimates are not coherent. For instance, the estimated variance and covariance matrix for (X, Y) can be negative definite! This does not happen in the simultaneous estimation of all the parameters by means of the maximum likelihood estimation 13
14
Example 14
15
Example Under the CIA the maximum likelihood estimate of the parameters are: From the previous estimates we get: 15
16
Parametric macro methods: the multinomial case
Let (X,Y,Z) be a multinomial r.v. with parameters: where is a vector of parameters with the following characteristics 16
17
Parametric macro methods: the multinomial case
Adopting the same re-parameterization of the joint distribution, under the CIA the parameters of interest are: In this context, the parameters of the joint distribution are computed according to the following formulas When the interest is only on the pairwise distribution (Y,Z) 17
18
Parametric macro methods: the multinomial case
Given the sample AB, the maximum likelihood estimator is 18
19
Example Let’s consider the following two samples A and B, where I=2, J=2, K=3. 19
20
Example The maximum likelihood estimates of the parameters are: 20
21
Example The maximum likelihood estimates of the parameters of the joint distribution are: 21
22
Parametric macro methods: conclusions
The CIA model is identifiable (i.e. with a unique estimate) for the data set AB The application of the maximum likelihood estimator is very easy Even if the problem is characterized by missing data, the problem can be split in three «complete data» subproblems, one for each parameter of the re-parameterization There can be other estimation methods can be statistically consistent, but incoherent 22
23
Selected references Anderson T W (1957) ``Maximum likelihood estimates for a multivariate normal distribution when some observations are missing'', JASA, 52, 200—203 Anderson T W (1984) An Introduction to Multivariate Statistical Analysis}, Wiley Rubin D B (1974)``Characterizing the Estimation of Parameters in Incomplete--Data Problems'', JASA, 69, 467—474 D'Orazio, M., Di Zio, M. and Scanu, M. (2006) Statistical matching for categorical data: displaying uncertainty and using logical constraints. JOS, 22, Moriarity C, Scheuren F (2001)``Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure'', JOS, 17,
24
The different matching contexts
Output Approach Parametric Nonparametric Macro Micro We are still in the familiar context of parametric data models, but now the objective is micro, i.e. we are interested in a complete data set where (X, Y, Z) are jointly available. This is the context where imputation methods are usually used! 24
25
Parametric micro methods
Objective: to create a complete data set for (X, Y, Z) Context: partially observed data set 25
26
Parametric micro methods
Method: imputation of missing values. In a parametric context: Estimate the distribution parameters Take a (not necessarily random) value from the estimated distribution 26
27
Parametric micro methods: conditional mean matching
A first method consist in covering the missing values with the corresponding expected frequencies The unkown parameters can be substituted by the estimates already discussed in the parametric macro methods 27
28
Parametric micro methods: conditional mean matching
Example: consider the normal case Imputations will be perfomed using the estimated regression functions Comments: Each imputation is the value whose distance with the shortest distance from all the possible values according to the estimated distribution (good if the purpose is to study unit characteristics, not population characteristics) The imputed values might not be «live» values These imputed values shrink the variability of the imputed variable In other words: is the complete data set an optimal one? 28
29
Example 29
30
Example The imputed Z variance is just a bit smaller than the observed one in B 30
31
Example The imputed Y variance is much smaller than the observed one in A (30,21 instead of 179,41). Why? 31
32
Parametric micro methods: conditional random draw
In order to preserve as much as possible the observed distributions, these are the steps to follow: Estimate the parameters according to a parametric macro method For each a, a=1,…,nA, generate a random draw 𝑧 𝑎 from For each b, a=1,…,nB, generate a random draw 𝑦 𝑏 from 32
33
Parametric micro methods: conditional random draw
Example: normal case Where e is a value generated from a normal distribution with zero mean and respectively and variance respectively Example: multinomial case Impute values in A through a random draw from the distribution Impute values in B through a random draw from the distribution 33
34
Parametric micro methods: conclusions
Parametric micro methods are based on the estimates obtained under the macro methods Caution on the variability of the imputed variables in the complete sample Results for the second point have been found under the name of matching noise 34
35
Selected references Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2nd edition, Wiley Rubin D B (1986) “Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations”, Journal of Business and Economic Statistics, 4, 87–94 Kadane J B (1978) “Some Statistical Problems in Merging Data Files”, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159–179. Republished on Journal of Official Statistics, 17, 423–433. Marella D., Scanu M., Conti P.L. (2008). “On the matching noise of some nonparametric imputation procedures”, Statistics and Probability Letters, 78, Conti P.L., Marella D., Scanu M. (2008). “Evaluation of matching noise for imputation techniques based on the local linear regression estimator”. Computational Statistics and Data Analysis, 53,
36
Non parametric macro methods
Output Approach Parametric Nonparametric Macro Micro The family of distributions where the distribution of (X, Y, Z) belongs cannot be represented by a finite number of parameters. Although the statistical literature on nonparametrics is huge, this is by far the most neglected situation in statistical matching! Anyway we check it in order to link macro and micro methods, as for the parametric case 36
37
Non parametric macro methods
We will not consider the case of Y and Z categorical, because it will mainly consist of the already described parametric methods. W will mainly refer to two easy nonparametric macro estimates, corresponding to the case X is categorical and numerical, respectively X categorical, Y and Z ordered or numerical: estimation of the empirical cumulative distribution X, Y and Z numerical: estimation of the nonparametric regression function As a matter of fact, the first approach will be helpful for the random generation of imputations in the corresponding micro methods, the second for the conditional mean matching and, whenever possible, adding a random residual 37
38
Empirical cumulative distribution function
Under the CIA, the cumulative distribution of Y and Z given X can be written as: Each factor can be estimated respectively from A and B: 38
39
Nonparametric regression function
Let us assume that: Z = r( X ) + where is such that E( | X )=0. If the function r ( .. ) is linear, then we get the usual linear regression function already studied in the parametric macro methods under the assumption of normality. In fact, here we do not restrict r ( .. ) to belong to a specific parametric family. 39
40
Non parametric regression function - kNN
Let us just consider the estimation of the regression function of Z on X (the results for Y are similar). As already seen in the parametric case, this regression function can be estimated restricting attention to sample B only. Estimation can be performed by means of the k Nearest Neighbour (k-NN) estimator Where 𝑊 𝑘𝑏 , 𝑏=1,…, 𝑛 𝑏 , is a sequence of weights assigned to the units in B according to ordering them via 𝑥− 𝑥 𝑏 from the smallest to the highest values Jx contains the first k unit labels of the ordering! 40
41
Non parametric regression function - kNN
In practice the expected value of Z given X=x is obtained by averaging the Z values of the k nearest observations to X=x. Conclusions nonparametric macro methods Although not used in practice, this part is useful in order to understand what happens in the widely used nonparametric micro methods Mainly two problems have been shown: estimation of the empirical cumulative distribution function and of the nonparametric regression 41
42
Selected references Wand M, Jones C (1995) Kernel Smoothing, Chapman & Hall. Härdle W (1992) Applied Nonparametric Regression, Cambridge University Press Paass G (1985) “Statistical record linkage methodology, state of the art and future prospects”, in Bulletin of the International Statistical Institute, Proceedings of the 45th Session, volume LI, Book 2 Marella D., Scanu M., Conti P.L. (2008) “On the matching noise of some nonparametric imputation procedures”, Statistics and Probability Letters, 78, 1593–1600 Conti P.L., Marella D., Scanu M. (2008). “Evaluation of matching noise for imputation techniques based on the local linear regression estimator”. Computational Statistics and Data Analysis, 53, 354–365
43
Non parametric micro methods
Output Approach Parametric Nonparametric Macro Micro Who applied these methods, seldom assumed anything about the distribution of (X, Y, Z) Each micro method has a macro counterpart, i.e. a representation of how the distribution of (X, Y, Z) should be done. The problem is: to be aware or not? 43
44
Non parametric micro methods
The nonparametric micro matching methods consist of essentially three imputation procedures Random hot deck Rank hot deck Distance hot deck As already seen in the parametric case, each one of these methods correspond to a specific nonparametric macro approach of the distribution 𝑓 𝑥,𝑦,𝑧 or of a characteristic value. In general, these methods do not organize the two data sets A and B as a unique sample AB. 44
45
Parametric micro methods
A is the recipient file and these are the data to impute B is the donor file and these are the data to use for imputation Parametric micro methods The idea is to consider a file as a recipient and the other as the donor 45
46
Example In order to define the different hot deck methods, let’s consider an example Example: let A and B be the following ones A : 𝑛 𝐴 =6, observed variables: Gender, Age, Income B : 𝑛 𝐵 =10, observed variables: Gender, Age, Expenditures A=recipient B=donor Common variables X=( 𝑋 1 =Gender, 𝑋 2 =Age) Y=(Income) Z=(Expenditures) 46
47
Example 47
48
Random hot deck: the method
Let us draw one random value from B and assign it to the first value to impute in A. Follow the same procedure for all the 𝑎∈𝐴 Example: In general we have 𝑛𝐵 𝑛𝐴 = possible different ways to impute A 48
49
Conditional random hot deck: the method
Let’s fix a conditional variable, e.g. 𝑋 1 For the first record a=1, let us draw one random value from the subset of units in B that 𝑋 1 𝑏 =𝐹. Follow the same procedure for all the 𝑎∈𝐴 Example: The number of different completed data sets we can get is 𝑚𝐵 𝑚𝐴 + 𝑛 𝐵 −𝑚𝐵 𝑛𝐴−𝑚𝐴 = =1312 49
50
Comments Random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z Conditional random hot deck corresponds to a random generation of the values to impute from the empirical cumulative distribution function of Z|X It is possible to eliminate the already drawn value from the set of possible donors (constrained procedure), anyway the preservation of the observed distribution of Z or Z|X in B is geopardized 50
51
Rank hot deck Let’s assume that 𝑛 𝐵 =𝑘 𝑛 𝐴 , k integer.
Compute the empirical cumulative distribution functions To each 𝑎∈𝐴 assign 𝑏 ∗ ∈𝐵 chosen so that In other words, this method imputes the values whose quintiles of X are similar in A and B respectively 51
52
Rank hot deck Rank the two sample A and B according to X1 52
53
Rank hot deck These are the values of the empirical cumulative distribution function of X1 in A and B respectively 53
54
Rank hot deck This is the result
In this example, there is only one way to impute a value 54
55
Distance hot deck To each 𝑎∈𝐴 assign 𝑏 ∗ ∈𝐵 chosen so that it is the nearest according to the common variables. This method depends on the distance function. Different choices are available. If X is numeric, it is possible to choose the Manhattan distance Other distances can be the Euclidean,… If X is multivariate, the available distances are the Mahalanobis, Canberra,… If X is categorical and unordered, it is possible to consider the classes of imputation (i.e. the distance is the «equality») 55
56
Distance hot deck - example
Let’s consider X2 as the variable to use in order to compute distances (i.e. we choose as donors those records in B whose age is the most similar to the one in A) Choose one value at random if there are more than one same distance donors The overall distance between donor and recipients is 56
57
Constrained distance hot deck
In the former procedure, a donor can be chosen more than once if it is the nearest of more than one record in A. In order to resue the same information more than once, the following constrained procedure has been defined Minimize Under the constraints 57
58
Constrained distance hot deck: advantages and disadvantages
Constrained matching allows to preserve the marginal distribution of the variable to impute (Z) Constrained distance hot deck is characterized by a larger distance between donors and recipients than the one for distance hot deck 58
59
Constrained distance hot deck: example
The overall donor recipient distance is 59
60
Constrained distance hot deck: comments
Distance hot deck is equivalent to the estimation of a nonparametric regression function via the kNN method, when k=1. These methods produce always live data as imputations Sometimes, parametric and nonparametric procedures are applied together: mixed methods Example: Regression step - impute intermediate values in A and B Matching step – use a distance hot deck by selecting b* with the shortest distance 60
61
Selected references Kadane J B (1978) “Some Statistical Problems in Merging Data Files”, in Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159–179. Published also on Journal of Official Statistics, 17, 423–433. Little R J A, Rubin D B (2002) Statistical Analysis with Missing Data, 2nd edition, Wiley Okner B A (1972) “Constructing a new data base from existing microdata sets: the 1966 merge file”, Annals of Economic and Social Measurement, 1, 325–342 Rodgers W L (1984) “An Evaluation of Statistical Matching”, Journal of Business and Economic Statistics, 2, 91–102 Rubin D B (1986) “Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations”, Journal of Business and Economic Statistics, 4, 87–94 Sims C A (1972), “Comments on Okner”, Annals of Economic and Social Measurement, 1, 343–345 Singh A C, Mantel H, Kinack M, Rowe G (1993) “Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption”, Survey Methodology, 19, 59–79 Marella D., Scanu M., Conti P.L. (2008). “On the matching noise of some nonparametric imputation procedures”, Statistics and Probability Letters, 78, Conti P.L., Marella D., Scanu M. (2008). “Evaluation of matching noise for imputation techniques based on the local linear regression estimator”. Computational Statistics and Data Analysis, 53,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.