Recognising real people in synthetic microdata: risk mitigation and impact on utility Beata Nowok Chris Dibben & Gillian Raab Administrative Data Research Centre – Scotland Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, 20-22 September 2017
Completely synthetic microdata Statistical disclosure control (SDC) method All values of all variables are generated from statistical models - microdata set of artificial units only Original data are used to inform the models (to reproduce their essential features)
Completely synthetic microdata Your data will only be used to create synthetic data for public- use, and none of your data values will ever be released (Rubin 1993)
‘Real’ people in synthetic microdata Records, created by chance, that are similar or identical to the records in the real data
‘Real’ people in synthetic microdata Information on unique or rare individuals might be inherently disclosive People may ‘recognise’ themselves and believe that data are real - loss of reputation for the data collection agency
Risk mitigation Removal of problematic ‘real’ units from synthetic data Unique real individuals that are replicated as unique in the synthetic data (uniqueness judged by all or a subset of variables)
Observed Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61
Observed Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Synthetic Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350
Observed Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Synthetic Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350
Observed Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED 54 57 VOCATIONAL/GRAMMAR 800 PLEASED 1500 38 NA MOSTLY DISSATISFIED 18 UNMARRIED 73 1700 MALE 61 -8 1197 68 DELIGHTED 41 20 30 Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28
Observed Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED 54 57 VOCATIONAL/GRAMMAR 800 PLEASED 1500 38 NA MOSTLY DISSATISFIED 18 UNMARRIED 73 1700 MALE 61 -8 1197 68 DELIGHTED 41 20 30 Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28 Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28
Removal of problematic records The scale of the deletion process The damage to the synthetic data quality
Data to be synthesised Variable name Description Data type Unique values sex Sex factor 2 smoke Smoking cigarettes 3 edu Highest educational qualification 5 marital Marital status placesize Category of the place of residence 6 ls Perception of life as a whole 8 socprof Socio-economic status 10 age Age numeric 79 income Personal monthly net income 406 bmi Body mass index 1,395 N = 5,000
Synthesis R package synthpop Series of conditional models - Classification and regression trees (CART) Synthesising order: sex, age, edu, placesize, socprof, marital, income, ls, smoke and bmi
Synthetic data 3 x 100 synthetic versions no smoothing smoothing during synthesis smoothing after synthesis Smoothing of all numeric variables: age, income, and bmi
Replication analysis A number of unique individuals in the observed data and a number of their unique replications in the synthetic data Uniqueness examined for all 1,023 possible combinations of the variables for all possible set sizes (between 1 and 10)
Sets of key variables used for identifying uniques and replications 99.8% Proportion of unique individuals in the original data Proportion of unique replications An average over 100 synthetic datasets (no smoothing) 17% Sets of key variables used for identifying uniques and replications
Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications
Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications
Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications
Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications
Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications
Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications
Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications
Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications
Impact of sample size
Impact of smoothing
Impact on synthetic data quality
Impact on synthetic data quality
Concluding remarks No serious data damage caused by removal of replicated uniques was identified The condition that synthetic uniques have to be replications of real unique individuals prevents the extensive removal of less frequent value combinations
Concluding remarks Results are only specific to the considered example (data characteristics and synthesising method) A greater impact may be observed for a single synthetic data Careful investigation of removal impact and possible alternatives is recommended