UNECE Workshop on Confidentiality Manchester, 17.-19. December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

ICES III Montreal, June 18-21, 2007 A new Approach for Disclosure Control in the IAB Establishment Panel Multiple Imputation for Better Data Access Jörg.
Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with.
Impact analysis and counterfactuals in practise: the case of Structural Funds support for enterprise Gerhard Untiedt GEFRA-Münster,Germany Conference:
Statistical Analysis SC504/HS927 Spring Term 2008
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
An Assessment of the Impact of Two Distinct Survey Design Modifications on Health Insurance Coverage Estimates in a National Health Care Survey Steven.
Example 1 To predict the asking price of a used Chevrolet Camaro, the following data were collected on the car’s age and mileage. Data is stored in CAMARO1.
STATISTICS FOR MANAGERS LECTURE 2: SURVEY DESIGN.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
4.3 Confidence Intervals -Using our CLM assumptions, we can construct CONFIDENCE INTERVALS or CONFIDENCE INTERVAL ESTIMATES of the form: -Given a significance.
Multiple Regression Analysis
Examining the use of administrative data for annual business statistics Joanna Woods, Ria Sanderson, Tracy Jones, Daniel Lewis.
Survey Design Steps in Conducting a survey.  There are two basic steps for conducting a survey  Design and Planning  Data Collection.
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
Lecture 10 Comparison and Evaluation of Alternative System Designs.
7-2 Estimating a Population Proportion
AGEC 622 Mission is prepare you for a job in business Have you ever made a price forecast? How much confidence did you place on your forecast? Was it correct?
Lecture 19 Simple linear regression (Review, 18.5, 18.8)
Increasing Survey Statistics Precision Using Split Questionnaire Design: An Application of Small Area Estimation 1.
Simple Linear Regression Analysis
Chapter 13: Inference in Regression
Hypothesis Testing in Linear Regression Analysis
Tax Subsidies for Out-of-Pocket Healthcare Costs Jessica Vistnes Agency for Healthcare Research and Quality William Jack Georgetown University Arik Levinson.
Multiple Imputation Approaches for Right-Censored Wages in the German IAB Employment Register European Conference on Quality in Official Statistics 2008,
BPS - 3rd Ed. Chapter 211 Inference for Regression.
© Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.
User-focused Threat Identification For Anonymised Microdata Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Influence of vocational training on wages and mobility of workers - evidence from Poland Jacek Liwiński Faculty of Economic Sciences, University of Warsaw.
Patterns of Event Causality Suggest More Effective Corrective Actions Abstract: The Occurrence Reporting and Processing System (ORPS) has used a consistent.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Performance of Resampling Variance Estimation Techniques with Imputed Survey data.
HAOMING LIU JINLI ZENG KENAN ERTUNC GENETIC ABILITY AND INTERGENERATIONAL EARNINGS MOBILITY 1.
Managerial Economics Demand Estimation & Forecasting.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
DPG Meeting National Panel Survey (NPS) 2008/09 – 2010/11 December 4, 2012.
Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data Hans-Peter Hafner and Rainer Lenz Research Data Centre.
IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.
Panel Analysis of NPOs in Germany Design and Preliminary Results Lutz Bellmann Christian Hohendanner André Pahnke Third International Conference on Establishment.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Ifo Institute for Economic Research at the University of Munich Employment Effects of Innovation at the Firm Level Stefan Lachenmaier *, Horst Rottmann.
Anonymization of longitudinal surveys in the presence of outliers Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Employment and Unemployment in the Recent Recession: Some German Institutions Revisited ICRIER Workshop New York University’s Stern School of Business.
EVALUATION OF THE RADAR PRECIPITATION MEASUREMENT ACCURACY USING RAIN GAUGE DATA Aurel Apostu Mariana Bogdan Coralia Dreve Silvia Radulescu.
Item-Non-Response and Imputation of Labor Income in Panel Surveys: A Cross-National Comparison ITEM-NON-RESPONSE AND IMPUTATION OF LABOR INCOME IN PANEL.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Pre-printing experiences at Statistics Sweden Anders Holmberg Department of Research & Development Statistics Sweden SE Örebro Sweden Tel:
EPUNet Conference 2006, Barcelona 1 Cross-national Comparison of Job Related Satisfaction in Poland and Old European Union Country Dorota Kwiatkowska-Ciotucha.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Chapter 14 Introduction to Multiple Regression
26134 Business Statistics Week 5 Tutorial
Multiple Regression Analysis and Model Building
Starter: complete the research methods paper
Eliminating Reproductive Risk Factors and Reaping Female Education and Work Benefits: A Constructed Cohort Analysis of 50 Developing Countries Qingfeng.
Overview of Approaches to Register-Based Populating Censuses
Section 5 Multiple Regression.
Chapter 12 Inference for Proportions
BEC 30325: MANAGERIAL ECONOMICS
Implementation of the Bayesian approach to imputation at SORS Zvone Klun and Rudi Seljak Statistical Office of the Republic of Slovenia Oslo, September.
Pre-training competencies and the productivity of apprentices
Pearson Correlation and R2
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control in the German IAB Establishment Panel Jörg Drechsler, Stefan Bender (Institute for Employment Research, Germany) & Susanne Rässler (University of Bamberg)

2 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of The Two Approaches  Comparison of The Results  Conclusion

3 Y synthetisch Fully synthetic data sets (Rubin 1993)  advantages: - data are fully synthetic - re-identification of single units almost impossible - all variables are still fully available  disadvantages: - strong dependence on the imputation model - setting up a model might be difficult/impossible Y observed X Y not observed Y synthetic

4 Partially synthetic data sets (Little 1993)  only potentially identifying or sensitive variables are replaced

5 Partially synthetic data sets (Little 1993)  only potentially identifying or sensitive variables are replaced

6 Partially synthetic data sets (Little 1993)  only potentially identifying or sensitive variables are replaced  advantages: - model dependence decreases - models are easier to set up  disadvantages: - true values remain in the data set - disclosure might still be possible

7 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of The Two Approaches  Comparison of the Results  Conclusions

8 The IAB Establishment Panel  Annually conducted Establishment Survey  Since 1993 in Western Germany, since 1996 in Eastern Germany  Population: All establishments with at least one employee covered by social security  Source: Official Employment Statistics  Response rate of repeatedly interviewed establishments more than 80%  Sample of more than establishments in the last wave  Contents: employment structure, changes in employment, business policies, investment, training, remuneration, working hours, collective wage agreements, works councils

9 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of the Two Approaches  Comparison of the Results  Conclusions

10 Generating fully synthetic data sets for the IAB Establishment Panel  Create a synthetic data set for selected variables from the wave 1997 from the Establishment Panel  Draw 10 new sample from the Official Employment Statistics using the same sampling design as for the Establishment Panel (Stratification by industry, size, and region)  The number of observations in each sample equals the number of observations in the panel n s =n p =7332  Every sample is imputed ten times using sequential regression  Number of variables from the establishment panel: 48  Imputations are generated using IVEware by Raghunathan, Solenberger and Hoewyk (2001)

11 Imputation procedure for partially synthetic data  Only two variables are synthesized: - number of employees - industry (16 categories)  Same variables for the imputation models  Imputation by sequential regression  Imputation model: - multinomial logit for the industry - linear model for the cubic root of the nb of employees - 4 independent linear models defined by quartiles for the establishment size  Imputations based on own coding in R.

12 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of The Two Approaches  Comparison of the Results  Conclusion

13 Analytical validity  Compare regression results from the original data with results from the synthetic data  First regression:  Zwick (2005) analyses the productivity effects of different continuing vocational training forms in Germany  Probit regression to explain, why firms offer vocational training  13 Explanatory variables including: Share of qualified employees, establishment size, industry, collective wage agreement, high qualification needs expected…  Second regression:  Log(number of employees) on 15 industry dummies  Two data utility measures: - Comparison of the beta coefficients from the original data set and the synthetic data sets - confidence interval overlap

14 Confidence interval overlap  Suggested by Karr et al. (2006)  Measure the overlap of CIs from the original data and CIs from the synthetic data  The higher the overlap, the higher the data utility  Compute the average relative CI overlap for any CI for the synthetic data CI for the original data

15 Significant at the 0,1 % levelSignificant at the 1 % levelSignificant at the 5 % level Results from the first regression (Zwick 2005)

16 Average overlap 0,8080,926 Average confidence interval (CI) overlap for the estimates from the first regression

17 = Significant at the 0,1 % level= Significant at the 1 % level= Significant at the 5 % level Results from the second regression (log(nb. of employees) on industry) = insignificant

18 Average overlap 0,6990,839 Average confidence interval (CI) overlap for the estimates from the second regression

19 Disclosure risk  Difficult to compare between partially and fully synthetic data sets  Disclosure risk is low for fully synthetic data sets, although not zero  DR is higher for partially synthetic data sets, because: True values remain in the data set Only survey respondents are included  For partially synthetic data sets a careful disclosure risk evaluation is necessary

20 Overview  Multiple Imputation for Statistical Disclosure Control  The IAB Establishment Panel  Application of The Two Approaches  Comparison of the Results  Conclusions

21 Conclusions  Generating synthetic data sets can be a useful method for SDC  Advantages for partially synthetic data sets: Higher data validity Imputation models easier to set up Lower risk of biased imputations  Disadvantages for partially synthetic data sets: Higher risk of disclosure Careful disclosure risk evaluation necessary  Agencies will have to decide depending on the complexity of the survey and the expected risk of disclosure

22 Thank you for your attention