Download presentation
Presentation is loading. Please wait.
Published byMarylou Whitehead Modified over 9 years ago
1
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland synthpop an R package for generating synthetic microdata
2
What is synthpop? A software tool for producing synthetic versions of sensitive microdata Administrative Data Research Centre - Scotland | Beata Nowok | 5-7 October 2015
3
SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Observed (input) SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED Synthetic (output) Data that look (structurally) like original data but contain artificial units only
4
Generating synthetic data: method Sequentially replacing original data values with synthetic values generated from conditional probability distributions fit draw Y j ~ (Y 0,Y 1,...,Y j−1 ) synthetic observed
5
http://cran.r-project.org/package=synthpop Generating synthetic versions of sensitive microdata for statistical disclosure control
8
Generating synthetic data: synthpop synthetic syn () observed
9
Synthesis can be run with default parameters (CART – Classification and Regression Trees) syn(data) Generating synthetic data: synthpop Administrative Data Research Centre - Scotland | Beata Nowok | 5-7 October 2015
12
syn() & common data problems Missing-data codes: cont.na categorical variables: additional factor level(s) continuous variables: specified by cont.na and modelled separately Semi-continuous variables: semicont Restricted values (interrelationships between variables): rules & rvalues Linear constraints: denom Non-negativity / non-normality: method set to ‘ lognorm’, ‘ sqrtnorm’ or ‘ cubertnorm’ Deterministic relations: method set to “~I(…)”
13
syn()
14
Overview of synthpop functions synthetic read.obs() write.syn() sdc() compare.synds()summary.synds() compare.fit.synds() glm.synds() summary.fit.synds() descriptive models syn () observed utility.synds() data structure
15
compare()
18
utility.synds()
19
sdc() & statistical disclosure control Data labelling: label Removing replicated uniques: rm.replicated.uniques Bottom- and top-coding: recode.vars, bottom.top.coding, recode.exclude At synthesis stage: smoothing, minbucket
20
sdc()
21
Conclusions The synthpop package for R: facilitating generation, evaluation and analysis of synthetic data Administrative Data Research Centre - Scotland | Beata Nowok | 5-7 October 2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.