Download presentation
Presentation is loading. Please wait.
Published bySilvester McLaughlin Modified over 9 years ago
1
Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration, Quality, Research and Production Networks Development Department, Istat dizio [at] istat.it
2
Eurostat Outline The problem Auxiliary information Auxiliary information in parametric models Auxiliary information in nonparametric models References
3
Eurostat The problem Let A U B be a sample of n A + n B observations i.i.d. from f(x, y, z), with Z missing on records of A, and Y missing on records of B. Two alternative models are identifiable for A U B : the CIA and the PIA. The reason is that those models involve only the distribution of X, Y|X and Z|X. When the CIA (or PIA) is not adapt to our problem it is necessary to use auxiliary information (if we want a point estimate).
4
Eurostat Example. The normal case (X,Y,Z) ~ N( The inestimable parameter is yz (or equivalently yz ) Under the CIA this is yz = xy xz / yz In general it holds yz = xy xz / 2 yz + yz|x We need information to fill the gap yz|x =? (or yz|x )
5
Eurostat Regression where
6
Eurostat Auxiliary information In general two different kinds of auxiliary information: 1)a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2)a plausible value of the inestimable parameters of either (Y,Z|X) or (Y,Z)
7
Eurostat Sources Sources may not be perfect: an outdated statistical investigation; administrative register; a supplemental (even small) ad hoc survey; proxy variables (Y°,Z°)
8
Eurostat Auxiliary information on parameters Previous surveys, assumptions made by the researcher, proxy variables may suggest a value * for the non estimable parameters. Two kinds of information: information about yz|x Information about yz
9
Eurostat Auxiliary information on parameters Consequences of information on parameters. It restricts the parameter space to a subspace * * involves all the param. in compatible with the auxiliary information
10
Eurostat Auxiliary information and likelihood Combining estimates and auxiliary information is easier when info is about yz|sx In general, the pdf f(x, y, z; θ) may be written as: f (x, y, z; θ ) = f X (x; θ X ) f YZ|X (y, z|x; θ YZ|X ) where x X, y Y, z Z and the paramet. space { } Reparametrised in two sets X = { θ X }, YZ|X = { θ YZ|X }.
11
Eurostat Auxiliary information about yz|x This information is precious but rarely available. An interesting case is when (X, Y,Z)~ N(μ, ). In this case the only information required is on ρ Y Z|X. Algorithm for the MLE estimate θ X on A U B estimate θ Y |X on A and θ Z|X on B with the previous estimates and ρ YZ|X = ρ* YZ|X we can compute and
12
Eurostat Auxiliary information about yz This information is more problematic This info does not guarantee a unique MLE (see e.g., multinomial distribution). it is not an easy task to combine this info with estimates obtained from A U B. It requires maximum constrained approaches
13
Eurostat Auxiliary information about yz This info does not guarantee a unique MLE We cannot estimate a log-linear model like However we can estimate
14
Eurostat Auxiliary information about yz Normal distribution This info guarantees a unique MLE The only parameter involving Y,Z is yz. Info on it is sufficient to fill the lack of knowledge
15
Eurostat Auxiliary information about yz Let us estimate yx and zx with and let yz = * yz. There are two possibilities 1) Auxiliary info is compatible with estimates
16
Eurostat Auxiliary information about yz 2) Auxiliary info is NOT compatible with estimates
17
Eurostat Example: Auxiliary info on yz = yz Let us suppose that Value * yz = 0.7 is compatible, det( ) =0.096. while * yz = 0.9 is not compatible, det( ) =-0.008
18
Eurostat Micro approach As in the micro approach under the CIA Conditional mean Random draw
19
Eurostat Conditional mean – Normal distribution Imputation of Z in A
20
Eurostat Random draw Imputation of Z in A
21
Eurostat Non-parametric methods Auxiliary information may be an additional file C Micro Hot-deck (A recipient and B donor) any record in A is imputed with a record from C (if a distance is used it is computed on (X, Y ) or Y if C is (X, Y,Z) or (Y Z)). The imputed record is (x a, y a, ˜z a (1) = z c* ) Z in a is imputed with a live value ˜z a (2) = z b* from B through hot-deck. If a distance is used, b* B minimizes d((x a, z c* ), (x b, z b )) the final data set is composed of (x a, y a, ˜z a (2) )
22
Eurostat Auxiliary information Auxiliary information can be 1. information on the inestimable parameters (e.g. ρ Y Z ), (as already introduced) 2. on some parameters not directly identifying the model; for instance, (X, Y,Z) are continuous but it is known the contingency table of a categorization of them This kind of auxiliary info can be dealt with by using mixed methods and non- parametric methods as well
23
Eurostat Mixed methods They use parametric and non-parametric approach, mainly in two steps. 1. Estimate the parametric model 2. use a hot deck procedure for the imputation of the missing data
24
Eurostat Mixed methods: Auxiliary file C Regression step 1 Regression step 2 Matching step For each obs. a is imputed z b* corresponding to the nearest neighbor b* in B,
25
Eurostat Mixed methods: Auxiliary file C Regression step Matching step For each obs. a is imputed z b* corresponding to the nearest neighbor b* in B,
26
Eurostat Mixed methods: Auxiliary file C Categorical variables 1.Estimation step Estimate ijk through the maximum likelihood applied to file C 2.Matching step For each obs. a it is found z b* through an hot-deck procedure. This value is used for the imputation if the corresponding estimated frequency of the cell (X=i,Y=j,Z=k) is not exceeded
27
Eurostat Mixed methods: ‘Coarse’ information We do not know the parameters of (X, Y, Z), but we know the contingency table for a categorization (X°, Y°, Z°) of (X, Y, Z) 1.Hot-deck step For each obs. a in A determine a ‘live’ value z c* in c* in C with respect to a distance d((x a,y a ),(x c,y c )). It is imputed only if the frequency of (X°, Y°, Z°) in A is not exceeded. Otherwise continue. 2.Matching step For each obs. a in A impute the live value z b* corresponding to the nearest neighbor b* in B with respect to the minimum distance d((x a, ~z a ), (x b,z b )).
28
Eurostat Selected references Rässler S. (2002) Statistical Matching: a frequentist theory, practical applications and alternative Bayesian approaches, Springer Moriarity C., Scheuren F. (2001)“Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure”, Jour. of Official Statistics, 17, 407–422 Moriarity C., Scheuren F. (2003) “A Note on Rubin’s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation”, Jour. of Business and Economic Statistics, 21, 65–73 Moriarity C., Scheuren F. (2004),“Regression–based statistical matching: recent developments”, Proceedings of the Section on Survey Research Methods, American Statistical Association D’Orazio M., Di Zio M., Scanu M. (2006) “Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints”, Jour. of Official Statistics, 22, 1–22
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.