Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,

Similar presentations


Presentation on theme: "Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,"— Presentation transcript:

1 Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration, Quality, Research and Production Networks Development Department, Istat dizio [at] istat.it

2 Eurostat Outline  The problem  Auxiliary information  Auxiliary information in parametric models  Auxiliary information in nonparametric models  References

3 Eurostat The problem  Let A U B be a sample of n A + n B observations i.i.d. from f(x, y, z), with Z missing on records of A, and Y missing on records of B.  Two alternative models are identifiable for A U B : the CIA and the PIA.  The reason is that those models involve only the distribution of X, Y|X and Z|X.  When the CIA (or PIA) is not adapt to our problem it is necessary to use auxiliary information (if we want a point estimate).

4 Eurostat Example. The normal case  (X,Y,Z) ~ N(   The inestimable parameter is  yz (or equivalently  yz )  Under the CIA this is  yz =  xy  xz /   yz  In general it holds  yz =  xy  xz /  2 yz +  yz|x  We need information to fill the gap  yz|x =? (or  yz|x )

5 Eurostat Regression where

6 Eurostat Auxiliary information  In general two different kinds of auxiliary information: 1)a third file C where either (X, Y,Z) or (Y,Z) are jointly observed 2)a plausible value of the inestimable parameters of either (Y,Z|X) or (Y,Z)

7 Eurostat Sources Sources may not be perfect:  an outdated statistical investigation;  administrative register;  a supplemental (even small) ad hoc survey;  proxy variables (Y°,Z°)

8 Eurostat Auxiliary information on parameters Previous surveys, assumptions made by the researcher, proxy variables may suggest a value  * for the non estimable parameters. Two kinds of information:  information about  yz|x  Information about  yz

9 Eurostat Auxiliary information on parameters  Consequences of information on parameters.  It restricts the parameter space  to a subspace  *   * involves all the param. in  compatible with the auxiliary information

10 Eurostat Auxiliary information and likelihood  Combining estimates and auxiliary information is easier when info is about  yz|sx  In general, the pdf f(x, y, z; θ) may be written as: f (x, y, z; θ ) = f X (x; θ X ) f YZ|X (y, z|x; θ YZ|X )  where x  X, y  Y, z  Z and the paramet. space  {  }  Reparametrised in two sets  X = { θ X },  YZ|X = { θ YZ|X }.

11 Eurostat Auxiliary information about  yz|x This information is precious but rarely available.  An interesting case is when (X, Y,Z)~ N(μ,  ).  In this case the only information required is on ρ Y Z|X. Algorithm for the MLE  estimate θ X on A U B  estimate θ Y |X on A and θ Z|X on B  with the previous estimates and ρ YZ|X = ρ* YZ|X we can compute  and

12 Eurostat Auxiliary information about  yz  This information is more problematic  This info does not guarantee a unique MLE (see e.g., multinomial distribution).  it is not an easy task to combine this info with estimates obtained from A U B.  It requires maximum constrained approaches

13 Eurostat Auxiliary information about  yz  This info does not guarantee a unique MLE  We cannot estimate a log-linear model like However we can estimate

14 Eurostat Auxiliary information about  yz Normal distribution  This info guarantees a unique MLE The only parameter involving Y,Z is  yz. Info on it is sufficient to fill the lack of knowledge

15 Eurostat Auxiliary information about  yz Let us estimate  yx and  zx with and let  yz =  * yz. There are two possibilities 1) Auxiliary info is compatible with estimates

16 Eurostat Auxiliary information about  yz 2) Auxiliary info is NOT compatible with estimates

17 Eurostat Example: Auxiliary info on  yz =  yz Let us suppose that Value  * yz = 0.7 is compatible, det(  ) =0.096. while  * yz = 0.9 is not compatible, det(  ) =-0.008

18 Eurostat Micro approach As in the micro approach under the CIA  Conditional mean  Random draw

19 Eurostat Conditional mean – Normal distribution Imputation of Z in A

20 Eurostat Random draw Imputation of Z in A

21 Eurostat Non-parametric methods Auxiliary information may be an additional file C Micro Hot-deck (A recipient and B donor)  any record in A is imputed with a record from C (if a distance is used it is computed on (X, Y ) or Y if C is (X, Y,Z) or (Y Z)). The imputed record is (x a, y a, ˜z a (1) = z c* )  Z in a is imputed with a live value ˜z a (2) = z b* from B through hot-deck. If a distance is used, b*  B minimizes d((x a, z c* ), (x b, z b ))  the final data set is composed of (x a, y a, ˜z a (2) )

22 Eurostat Auxiliary information Auxiliary information can be 1. information on the inestimable parameters (e.g. ρ Y Z ), (as already introduced) 2. on some parameters not directly identifying the model; for instance, (X, Y,Z) are continuous but it is known the contingency table of a categorization of them This kind of auxiliary info can be dealt with by using mixed methods and non- parametric methods as well

23 Eurostat Mixed methods They use parametric and non-parametric approach, mainly in two steps. 1. Estimate the parametric model 2. use a hot deck procedure for the imputation of the missing data

24 Eurostat Mixed methods: Auxiliary file C  Regression step 1  Regression step 2  Matching step For each obs. a is imputed z b* corresponding to the nearest neighbor b* in B,

25 Eurostat Mixed methods: Auxiliary file C  Regression step  Matching step For each obs. a is imputed z b* corresponding to the nearest neighbor b* in B,

26 Eurostat Mixed methods: Auxiliary file C Categorical variables 1.Estimation step Estimate  ijk through the maximum likelihood applied to file C 2.Matching step For each obs. a it is found z b* through an hot-deck procedure. This value is used for the imputation if the corresponding estimated frequency of the cell (X=i,Y=j,Z=k) is not exceeded

27 Eurostat Mixed methods: ‘Coarse’ information We do not know the parameters of (X, Y, Z), but we know the contingency table for a categorization (X°, Y°, Z°) of (X, Y, Z) 1.Hot-deck step For each obs. a in A determine a ‘live’ value z c* in c* in C with respect to a distance d((x a,y a ),(x c,y c )). It is imputed only if the frequency of (X°, Y°, Z°) in A is not exceeded. Otherwise continue. 2.Matching step For each obs. a in A impute the live value z b* corresponding to the nearest neighbor b* in B with respect to the minimum distance d((x a, ~z a ), (x b,z b )).

28 Eurostat Selected references  Rässler S. (2002) Statistical Matching: a frequentist theory, practical applications and alternative Bayesian approaches, Springer  Moriarity C., Scheuren F. (2001)“Statistical Matching: a Paradigm for Assessing the Uncertainty in the Procedure”, Jour. of Official Statistics, 17, 407–422  Moriarity C., Scheuren F. (2003) “A Note on Rubin’s Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputation”, Jour. of Business and Economic Statistics, 21, 65–73  Moriarity C., Scheuren F. (2004),“Regression–based statistical matching: recent developments”, Proceedings of the Section on Survey Research Methods, American Statistical Association  D’Orazio M., Di Zio M., Scanu M. (2006) “Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints”, Jour. of Official Statistics, 22, 1–22


Download ppt "Eurostat Statistical Matching using auxiliary information Training Course «Statistical Matching» Rome, 6-8 November 2013 Marco Di Zio Dept. Integration,"

Similar presentations


Ads by Google