Download presentation
Presentation is loading. Please wait.
Published byAmbrose Horn Modified over 9 years ago
1
Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield
2
What is synthpop? A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis and preparing code Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014
3
SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED Synthetic (output) Data that look (structurally) like original data but contain artificial units only
4
Data that behave (statistically) like original data
5
http://cran.r-project.org/package=synthpop Generating synthetic versions of sensitive microdata for statistical disclosure control package
8
Generating synthetic data Sequentially replacing original data values with synthetic values generated from conditional probability distributions fit draw Y j ~ (Y 0,Y 1,...,Y j−1 ) synthetic real
9
Generating synthetic data synthetic real syn ()
10
Overview of synthpop functions synthetic real read.real() write.syn() sdc() compare.synds()summary.synds() compare.fit.synds() glm.synds() summary.fit.synds() descriptive models syn ()
12
syn() & common data problems Missing-data codes: contNA categorical variables: additional factor level(s) continuous variables: specified by contNA and modelled separately Semi-continuous variables: semicont Restricted values (interrelationships between variables): rules & rvalues Linear constraints: denom Non-negativity / non-normality: method set to ‘ lognorm’, ‘ sqrtnorm’ or ‘ cubertnorm’ Deterministic relations: method set to “~I(…)” Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014
13
sdc() & statistical disclosure control Data labelling: label Removing replicated uniques: rm.replicated.uniques Bottom- and top-coding: recode.vars, bottom.top.coding, recode.exclude syn(): smoothing, minbucket Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014 sdc(syn.obj, real, label="false data", rm.replicated.uniques = TRUE, recode.vars = c("age","income"), bottom.top.coding = list(c(NA,85),c(NA,1500)))
14
SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) Synthetic (output) sdc() SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED SexAgeEducation Marital status IncomeLife satisfaction false dataMALE81PRIMARY/NO EDUCATIONMARRIED1500PLEASED false dataMALE54VOCATIONAL/GRAMMARMARRIED1500PLEASED false dataFEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED false dataFEMALE85PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED false dataFEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED false dataFEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED false dataMALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED false dataFEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED false dataMALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED false dataFEMALE29SECONDARYMARRIED580MOSTLY SATISFIED false dataMALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED false dataMALE18SECONDARYUNMARRIED-8PLEASED false dataFEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED
15
SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED Synthetic (output)
16
Disclosure control Providing sufficient disclosure protection Disclosure control measures Watermarking Partially synthetic data Data synthesis Handling various data types, data structures and real data problems Stratified synthesis Value bounds Multiple event data Household and other hierarchical data Complex survey design Small geographic areas Package usability Making synthpop flexible and accessible to a wider range of users A graphical user interface (GUI) Dealing with computational limitations Support for LSs projects Training workshops Quality of synthetic data Measuring and improving analytical validity Tests of synthesising approaches (parametric vs CART models) CART extensions Case studies for ADRC-S projects Guidelines for best practise synthpop: future developments http://cran.r-project.org/package=synthpop
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.