Beata Nowok Chris Dibben & Gillian Raab Administrative Data

Recognising real people in synthetic microdata: risk mitigation and impact on utility
Beata Nowok Chris Dibben & Gillian Raab Administrative Data Research Centre – Scotland Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, September 2017

Completely synthetic microdata
Statistical disclosure control (SDC) method All values of all variables are generated from statistical models - microdata set of artificial units only Original data are used to inform the models (to reproduce their essential features)

Completely synthetic microdata
Your data will only be used to create synthetic data for public- use, and none of your data values will ever be released (Rubin 1993)

‘Real’ people in synthetic microdata
Records, created by chance, that are similar or identical to the records in the real data

‘Real’ people in synthetic microdata
Information on unique or rare individuals might be inherently disclosive People may ‘recognise’ themselves and believe that data are real - loss of reputation for the data collection agency

Risk mitigation Removal of problematic ‘real’ units from synthetic data Unique real individuals that are replicated as unique in the synthetic data (uniqueness judged by all or a subset of variables)

Observed Sex Age Education Marital status Income Life satisfaction
FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61

Observed Synthetic Sex Age Education Marital status Income
Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Synthetic Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350

Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Synthetic Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350

Life satisfaction FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED 54 57 VOCATIONAL/GRAMMAR 800 PLEASED 1500 38 NA MOSTLY DISSATISFIED 18 UNMARRIED 73 1700 MALE 61 -8 1197 68 DELIGHTED 41 20 30 Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28

Life satisfaction FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED 54 57 VOCATIONAL/GRAMMAR 800 PLEASED 1500 38 NA MOSTLY DISSATISFIED 18 UNMARRIED 73 1700 MALE 61 -8 1197 68 DELIGHTED 41 20 30 Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28 Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28

Removal of problematic records
The scale of the deletion process The damage to the synthetic data quality

Data to be synthesised Variable name Description Data type
Unique values sex Sex factor 2 smoke Smoking cigarettes 3 edu Highest educational qualification 5 marital Marital status placesize Category of the place of residence 6 ls Perception of life as a whole 8 socprof Socio-economic status 10 age Age numeric 79 income Personal monthly net income 406 bmi Body mass index 1,395 N = 5,000

Synthesis R package synthpop
Series of conditional models - Classification and regression trees (CART) Synthesising order: sex, age, edu, placesize, socprof, marital, income, ls, smoke and bmi

Synthetic data 3 x 100 synthetic versions no smoothing
smoothing during synthesis smoothing after synthesis Smoothing of all numeric variables: age, income, and bmi

Replication analysis A number of unique individuals in the observed data and a number of their unique replications in the synthetic data Uniqueness examined for all 1,023 possible combinations of the variables for all possible set sizes (between 1 and 10)

Sets of key variables used for identifying uniques and replications
99.8% Proportion of unique individuals in the original data Proportion of unique replications An average over 100 synthetic datasets (no smoothing) 17% Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications
Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Impact of sample size

Impact of smoothing

Impact on synthetic data quality

Concluding remarks No serious data damage caused by removal of replicated uniques was identified The condition that synthetic uniques have to be replications of real unique individuals prevents the extensive removal of less frequent value combinations

Concluding remarks Results are only specific to the considered example (data characteristics and synthesising method) A greater impact may be observed for a single synthetic data Careful investigation of removal impact and possible alternatives is recommended

Beata Nowok Chris Dibben & Gillian Raab Administrative Data

Similar presentations

Presentation on theme: "Beata Nowok Chris Dibben & Gillian Raab Administrative Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Beata Nowok Chris Dibben & Gillian Raab Administrative Data

Similar presentations

Presentation on theme: "Beata Nowok Chris Dibben & Gillian Raab Administrative Data"— Presentation transcript:

Similar presentations

About project

Feedback