Presentation is loading. Please wait.

Presentation is loading. Please wait.

Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.

Similar presentations


Presentation on theme: "Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield."— Presentation transcript:

1 Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield

2 What is synthpop?  A tool for producing synthetic versions of microdata containing confidential information so that they are safe to be released to users for exploratory analysis and preparing code Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014

3 SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED Synthetic (output) Data that look (structurally) like original data but contain artificial units only

4 Data that behave (statistically) like original data

5 http://cran.r-project.org/package=synthpop Generating synthetic versions of sensitive microdata for statistical disclosure control package

6

7

8 Generating synthetic data Sequentially replacing original data values with synthetic values generated from conditional probability distributions fit draw Y j ~ (Y 0,Y 1,...,Y j−1 ) synthetic real

9 Generating synthetic data synthetic real syn ()

10 Overview of synthpop functions synthetic real read.real() write.syn() sdc() compare.synds()summary.synds() compare.fit.synds() glm.synds() summary.fit.synds() descriptive models syn ()

11

12 syn() & common data problems  Missing-data codes: contNA  categorical variables: additional factor level(s)  continuous variables: specified by contNA and modelled separately  Semi-continuous variables: semicont  Restricted values (interrelationships between variables): rules & rvalues  Linear constraints: denom  Non-negativity / non-normality: method set to ‘ lognorm’, ‘ sqrtnorm’ or ‘ cubertnorm’  Deterministic relations: method set to “~I(…)” Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014

13 sdc() & statistical disclosure control  Data labelling: label  Removing replicated uniques: rm.replicated.uniques  Bottom- and top-coding: recode.vars, bottom.top.coding, recode.exclude  syn(): smoothing, minbucket Administrative Data Research Centre - Scotland | Beata Nowok | 1 December 2014 sdc(syn.obj, real, label="false data", rm.replicated.uniques = TRUE, recode.vars = c("age","income"), bottom.top.coding = list(c(NA,85),c(NA,1500)))

14 SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) Synthetic (output) sdc() SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED SexAgeEducation Marital status IncomeLife satisfaction false dataMALE81PRIMARY/NO EDUCATIONMARRIED1500PLEASED false dataMALE54VOCATIONAL/GRAMMARMARRIED1500PLEASED false dataFEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED false dataFEMALE85PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED false dataFEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED false dataFEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED false dataMALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED false dataFEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED false dataMALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED false dataFEMALE29SECONDARYMARRIED580MOSTLY SATISFIED false dataMALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED false dataMALE18SECONDARYUNMARRIED-8PLEASED false dataFEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED

15 SexAgeEducation Marital status IncomeLife satisfaction FEMALE57VOCATIONAL/GRAMMARMARRIED800PLEASED MALE41SECONDARYUNMARRIED1500MIXED FEMALE18VOCATIONAL/GRAMMARUNMARRIEDNAPLEASED FEMALE78PRIMARY/NO EDUCATIONWIDOWED900MIXED FEMALE54VOCATIONAL/GRAMMARMARRIED1500MOSTLY SATISFIED MALE20SECONDARYUNMARRIED-8PLEASED FEMALE39SECONDARYMARRIED2000MOSTLY SATISFIED MALE39SECONDARYMARRIED1197MIXED FEMALE38VOCATIONAL/GRAMMARMARRIEDNAMOSTLY DISSATISFIED FEMALE73VOCATIONAL/GRAMMARWIDOWED1700PLEASED FEMALE54SECONDARYWIDOWED2000MOSTLY SATISFIED MALE30VOCATIONAL/GRAMMARUNMARRIED900MOSTLY SATISFIED MALE68SECONDARYMARRIED-8DELIGHTED MALE61PRIMARY/NO EDUCATIONMARRIED-8MIXED Real (input) SexAgeEducation Marital status IncomeLife satisfaction MALE81PRIMARY/NO EDUCATIONMARRIED2100PLEASED MALE54VOCATIONAL/GRAMMARMARRIED1700PLEASED FEMALE32VOCATIONAL/GRAMMARDIVORCED870MIXED FEMALE98PRIMARY/NO EDUCATIONMARRIED800MOSTLY DISSATISFIED FEMALE50PRIMARY/NO EDUCATIONMARRIEDNAMOSTLY SATISFIED FEMALE37VOCATIONAL/GRAMMARMARRIED158PLEASED MALE28VOCATIONAL/GRAMMARNA1500MOSTLY SATISFIED FEMALE62PRIMARY/NO EDUCATIONMARRIED830MOSTLY SATISFIED MALE78PRIMARY/NO EDUCATIONMARRIEDNAPLEASED FEMALE29SECONDARYMARRIED580MOSTLY SATISFIED MALE59PRIMARY/NO EDUCATIONMARRIED1300MOSTLY SATISFIED MALE41SECONDARYUNMARRIED1500MIXED MALE18SECONDARYUNMARRIED-8PLEASED FEMALE73PRIMARY/NO EDUCATIONWIDOWED1350MOSTLY SATISFIED Synthetic (output)

16 Disclosure control Providing sufficient disclosure protection Disclosure control measures Watermarking Partially synthetic data Data synthesis Handling various data types, data structures and real data problems Stratified synthesis Value bounds Multiple event data Household and other hierarchical data Complex survey design Small geographic areas Package usability Making synthpop flexible and accessible to a wider range of users A graphical user interface (GUI) Dealing with computational limitations Support for LSs projects Training workshops Quality of synthetic data Measuring and improving analytical validity Tests of synthesising approaches (parametric vs CART models) CART extensions Case studies for ADRC-S projects Guidelines for best practise synthpop: future developments http://cran.r-project.org/package=synthpop


Download ppt "Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield."

Similar presentations


Ads by Google