Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts and Economic Statistics, Istat madorazi [at] istat.it

Eurostat Outline Problems in evaluation Evaluation in the macro SM applications Evaluation in the micro SM applications

Eurostat Steps in a SM application target variables 1) Choice of the target variables, i.e. of the variables observed distinctly in two sample surveys. common variables 2) Identification of all the common variables in the two sources. Not all can be used due to lack of harmonization, different definitions, etc. matching variables 3) Choice of the matching variables only those that are able to predict the target variables. 4) Application of the chosen SM technique 5) Evaluation of the results of the SM For major details see Scanu (2008)

Eurostat phenomena not jointly observed (i)The general objective of SM is to study the relationship of phenomena not jointly observed, unless an additional auxiliary data source is available. different outputs (ii)The SM can provide different outputs:  a synthetic data set in the micro case;  one or more estimates (e.g. a correlation coefficient, a regression coefficient, probabilities in a contingency table, etc.) in the macro case. data sources which may have different quality “levels” (iii)There are two or more data sources which may have different quality “levels” (sampling design, sample size, data processing steps, etc.) Problems in Evaluation

Eurostat It is the major source of uncertainty concerning the matching results. This lack of information has to be filled in the by: -making some assumptions (e.g. the conditional independence of the target variables given the matching variables) -using additional auxiliary information (an external estimate of the interest parameters or an additional data source, etc.). uncertainty Unless an alternative approach is used, which consists in evaluating just the uncertainty due to this situation. Problems in Evaluation: (i) Phenomena Not Jointly Observed

Eurostat The results of the SM will necessarily reflect the underlying assumptions/information being used: -results of a matching application based on the CI assumption will reflect it; they will be unreliable if CI is not holding. -If auxiliary information is used (CIA avoided), the result of the SM are expected to reflect such input. If the input information is not reliable, the results of SM will be unreliable. additional noise or bias In this setting a researcher “kwows” what to expect but he has check whether the chosen matching method has been applied correctly, avoiding the introduction of additional noise or bias. Problems in Evaluation: (i) Phenomena Not Jointly Observed (cont.)

Eurostat The outputs are estimates of parameters. It may be easy to check whether there is some additional noise: Under the CI assumption in some cases it is possible derive analytic estimation formulas for the parameters of interest. Examples: Correlation coefficient : cell probabilities : Evaluation: Checks in the Macro Case

Eurostat When auxiliary information has been considered, - If it consists in an external estimate of the target parameter then it is possible to compare it with the final estimate obtained at the end of the SM -If it consists in an estimate of the parameter that is not the target one, it would necessary to understand the relationship between the two parameters. -If it consists in an additional data source it is necessary to understand how it has been used in the whole SM estimation process. Evaluation: Checks in the Macro Case (cont.)

Eurostat In complex situations some suggestions a)Test the SM in a small pilot study in which it is easy the control the whole process sensitivity analysis b)Carry out a sensitivity analysis (check how the output changes by changing one or more of the input parameters) c)Carry out a series of simulations: replication of the matching application a high number of times given a set of inputs (sometimes just a small controlled randomness is introduced) Evaluation: Checks in the Macro Case (cont.)

Eurostat The output of SM is a synthetic file with all the needed variables It should be checked whether it can be considered a representative sample (in a wide sense, considering the relationship between variables too) Can be done just partially because the of lack of joint information concerning Y and Z, unless some auxiliary information it is available. Evaluation: Checks in the Micro Case

Eurostat validity Rässler’s (2002) suggests to look at the “validity” of the SM procedure by analyzing how the synthetic data set: a)preserves the marginal distribution of the imputed variable (reference is the one in the donor data set); b)preserves the joint distribution of the imputed variable with the matching variables (reference is the one in the donor data set). In order to compare marginal or joint distributions of the variables in the synthetic data set with respect to the one in the donor it is possible to use statistical tests and descriptive measures. Evaluation: Checks in the Micro Case (cont.)

Eurostat Statistical tests should be used to compare the distributions of Z variable imputed in the synthetic data set with respect to the one in the donor (reference). E.g. Chi-Squared, Kolmogorov-Smirnov, etc: modified Chi-Square tests Ad hoc modified tests are available to deal with data from complex sample surveys (for modified Chi-Square tests cf. Sarndal et al., 1992, pp. 500-513) The modified tests require additional information (estimates of the sampling variance or of the design effect) which in some cases may not be available. Relatively few modified tests are available (e.g. the Kolmogorov-Smirnov test accounting for complex sampling design does not exist) Evaluation: Micro Case, Comparing Distributions

Eurostat similarity/dissimilarity measures When the statistical tests are not applicable it is possible adopt an “empirical approach” which consists in comparing the marginal distributions estimated from the two surveys using similarity/dissimilarity measures dissimilarity indextotal variation distance The dissimilarity index or total variation distance among distributions is:, 0 means that the distributions are equal Can be interpreted as the smallest fraction of units in A that would need to be re-classified in order to make the distribution equal to B. Agresti (2002, pp. 329-330) or 0.03, denotes that the data in A follow the distribution in B quite closely, even though it is not perfect Evaluation: Micro Case, Similarity/diss. Between Distributions

Eurostat overlap The overlap between two distributions is:, 1 means that the distributions are equal Strictly related to the dissimilarity index: Using Agresti’s rule of thumb (2002, pp. 329-330), denotes that the data in A follow the distribution in B quite closely, even though not perfectly Evaluation: Micro Case, Similarity/diss. Between Distributions (cont.)

Eurostat Hellinger distance A distance among the two distributions can be computed by means of the Hellinger distance: Satisfies properties of a distance measure: symmetry, triangle inequality, and (0 means that the distributions are equal) Bhattacharyya coefficient ( ) B is the Bhattacharyya coefficient ( ) Rule of thumb: distributions are close (few literature ref.) It is related to the dissimilarity index: Example: if then Evaluation: Micro Case, Similarity/diss. Between Distributions (cont.)

Eurostat Similarity/dissimilarity measures can be used when dealing with categorical nominal or ordered variables. When dealing with continuous variables they can not be applied unless the variables are categorized (into equal width bins or according to the percentiles of the reference variable; see for instance rules used to determine the number of classes when drawing histograms) Evaluation: Micro Case, Similarity/diss. Between Distributions (cont.)

Eurostat Selected references Agresti, A (2002) Categorical Data Analysis, 2nd Edition. Wiley, Chichester. D’Orazio, M (2011b) “Statistical Matching and Imputation of Survey Data with the Package StatMatch for the R Environment” R package vignette. ttp://www.cros- portal.eu/sites/default/files//Statistical_Matching_with_StatMatch.pdf D’Orazio, M and Di Zio, M and Scanu, M (2006) Statistical Matching: Theory and Practice. Wiley, Chichester Särndal, CE and Swensson, B and Wretman, J (1992) Model Assisted Survey Sampling. Springer-Verlag, New York. Scanu, M (2008) “The practical aspects to be considered for statistical matching”, In Eurostat Report of WP2: Recommendations on the use of methodologies for the integration of surveys and administrative data, ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data, pp. 34-35. http://cenex-isad.istat.it

Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Similar presentations

Presentation on theme: "Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Similar presentations

Presentation on theme: "Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts."— Presentation transcript:

Similar presentations

About project

Feedback