UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.

UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control in the German IAB Establishment Panel Jörg Drechsler, Stefan Bender (Institute for Employment Research, Germany) & Susanne Rässler (University of Bamberg)

2 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of The Two Approaches  Comparison of The Results  Conclusion

3 Y synthetisch Fully synthetic data sets (Rubin 1993)  advantages: - data are fully synthetic - re-identification of single units almost impossible - all variables are still fully available  disadvantages: - strong dependence on the imputation model - setting up a model might be difficult/impossible Y observed X Y not observed Y synthetic

4 Partially synthetic data sets (Little 1993)  only potentially identifying or sensitive variables are replaced

5 Partially synthetic data sets (Little 1993)  only potentially identifying or sensitive variables are replaced

6 Partially synthetic data sets (Little 1993)  only potentially identifying or sensitive variables are replaced  advantages: - model dependence decreases - models are easier to set up  disadvantages: - true values remain in the data set - disclosure might still be possible

7 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of The Two Approaches  Comparison of the Results  Conclusions

8 The IAB Establishment Panel  Annually conducted Establishment Survey  Since 1993 in Western Germany, since 1996 in Eastern Germany  Population: All establishments with at least one employee covered by social security  Source: Official Employment Statistics  Response rate of repeatedly interviewed establishments more than 80%  Sample of more than 16.000 establishments in the last wave  Contents: employment structure, changes in employment, business policies, investment, training, remuneration, working hours, collective wage agreements, works councils

9 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of the Two Approaches  Comparison of the Results  Conclusions

10 Generating fully synthetic data sets for the IAB Establishment Panel  Create a synthetic data set for selected variables from the wave 1997 from the Establishment Panel  Draw 10 new sample from the Official Employment Statistics using the same sampling design as for the Establishment Panel (Stratification by industry, size, and region)  The number of observations in each sample equals the number of observations in the panel n s =n p =7332  Every sample is imputed ten times using sequential regression  Number of variables from the establishment panel: 48  Imputations are generated using IVEware by Raghunathan, Solenberger and Hoewyk (2001)

11 Imputation procedure for partially synthetic data  Only two variables are synthesized: - number of employees - industry (16 categories)  Same variables for the imputation models  Imputation by sequential regression  Imputation model: - multinomial logit for the industry - linear model for the cubic root of the nb of employees - 4 independent linear models defined by quartiles for the establishment size  Imputations based on own coding in R.

12 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of The Two Approaches  Comparison of the Results  Conclusion

13 Analytical validity  Compare regression results from the original data with results from the synthetic data  First regression:  Zwick (2005) analyses the productivity effects of different continuing vocational training forms in Germany  Probit regression to explain, why firms offer vocational training  13 Explanatory variables including: Share of qualified employees, establishment size, industry, collective wage agreement, high qualification needs expected…  Second regression:  Log(number of employees) on 15 industry dummies  Two data utility measures: - Comparison of the beta coefficients from the original data set and the synthetic data sets - confidence interval overlap

14 Confidence interval overlap  Suggested by Karr et al. (2006)  Measure the overlap of CIs from the original data and CIs from the synthetic data  The higher the overlap, the higher the data utility  Compute the average relative CI overlap for any CI for the synthetic data CI for the original data

15 Significant at the 0,1 % levelSignificant at the 1 % levelSignificant at the 5 % level Results from the first regression (Zwick 2005)

16 Average overlap 0,8080,926 Average confidence interval (CI) overlap for the estimates from the first regression

17 = Significant at the 0,1 % level= Significant at the 1 % level= Significant at the 5 % level Results from the second regression (log(nb. of employees) on industry) = insignificant

18 Average overlap 0,6990,839 Average confidence interval (CI) overlap for the estimates from the second regression

19 Disclosure risk  Difficult to compare between partially and fully synthetic data sets  Disclosure risk is low for fully synthetic data sets, although not zero  DR is higher for partially synthetic data sets, because: True values remain in the data set Only survey respondents are included  For partially synthetic data sets a careful disclosure risk evaluation is necessary

20 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of The Two Approaches  Comparison of the Results  Conclusions

21 Conclusions  Generating synthetic data sets can be a useful method for SDC  Advantages for partially synthetic data sets: Higher data validity Imputation models easier to set up Lower risk of biased imputations  Disadvantages for partially synthetic data sets: Higher risk of disclosure Careful disclosure risk evaluation necessary  Agencies will have to decide depending on the complexity of the survey and the expected risk of disclosure

22 Thank you for your attention

UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.

Similar presentations

Presentation on theme: "UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.

Similar presentations

Presentation on theme: "UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control."— Presentation transcript:

Similar presentations

About project

Feedback