Download presentation
Presentation is loading. Please wait.
Published byFelix Mason Modified over 9 years ago
1
Some aspects concerning analytical validity and disclosure risk of CART generated synthetic data Hans-Peter Hafner and Rainer Lenz Research Data Centre of the Statistical Offices of the Länder Saarland State University of Applied Sciences UNECE Work Session on Statistical Data Confidentiality Tarragona, 27 October 2011
2
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 2 Overview Project background CART for synthetic data: Methodology and model modifications Case Study: Sample of Monthly Report Manufacturing Sector Analytical potential of the synthetic data Testing confidentiality: Matching simulations Prospects
3
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 3 Project Background German project InfinitE Controlled remote data execution easier to handle for both, scientists and RDC staff Need for semantically valid data structure files Automation of output checking
4
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 4 CART for synthetic Data: Methodology Step 1: Generation of the tree using the original data Step 2: Drawing of synthetic values in each leaf of the tree (Bayesian bootstrap) For previously synthesized variables: original values are replaced by synthetic ones
5
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 5 CART for synthetic Data: Model Modifications Synthesis order of the variables Model specification: All variables – only variables previously synthesized or not synthesized at all Stopping rules: R package rpart complexity parameter cp Further split only if overall lack of fit is decreased by cp
6
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 6 Case Study: Data Monthly report on local units of the manufacturing sector (15% sample of the longitudinal section 1999 – 2002: 6483 units) Full survey of local units with focus on economic activity in the manufacturing sector and at least 20 employees Attributes: Location, turnover, wages and salaries, working hours
7
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 7 Case Study: Anonymisation – Prerequisites Transformation of continuous variables: Cubic root Coarsening NACE code to 2-digit and regional key to federal states level Different CART trees for five subsets of the data (turnover size classes) Two variants: (i) Absolute numerical values for all variables and all years (ii) Absolute values for 1999 and rates of change for 2000-2002
8
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 8 Case Study: Analytical Validity NEC = rate of job creation – rate of job destruction (net employment change) JT = rate of job creation + rate of job destruction (job turnover) Original(i) cp =0.01 (i) cp = 0.00001 (ii) cp = 0.01 (ii) cp = 0.00001 Turnover Index 2002 (1999 = 100) 112.0119.7112.7112.6110.8 NEC Pooled-0.531.280.91-0.38-0.45 JT Pooled7.2931.8317.587.947.63
9
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 9 Case Study: Analytical Validity Results using rates of change much better than those using absolute values Results for smaller values of parameter cp tend to be better Problem: Variation between different synthetic data sets is very large Aim: Only one synthetic data structure file for researcher
10
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 10 Case Study: Matching Experiments External data: Original data Blocking variables: Size classes of turnover and number of employees (mean 1999 – 2002) Key variables: Number of employees Turnover First results for cp = 0.00001: For one block 27% hits, for other blocks no more than 15% Database cross match
11
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 11 Prospects Examination of synthesis order and model specification Further analyses regarding the optimal value of cp Better adaption of the matching procedure to synthetic data
12
© Statistische Ämter der Länder, ForschungsdatenzentrumFolie 12 Contact Hans-Peter Hafner Research Data Center of the Statistical Offices of the Länder hhafner@statistik-hessen.de Rainer Lenz Saarland State University of Applied Sciences rainer.lenz@htw-saarland.de www.forschungsdatenzentrum.de Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.