Beata Nowok Chris Dibben & Gillian Raab Administrative Data

Slides:



Advertisements
Similar presentations
Thu. 3 June An empirical study of the “healthy immigrant effect” with Canadian Community Health Survey Yimin (Gloria) Lou, M.A. Candidate University.
Advertisements

Associations between Obesity and Depression by Race/Ethnicity and Education among Women: Results from the National Health and Nutrition Examination Survey,
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
Workshop on Synthetic Data, 1st December 2014, ONS, Titchfield.
Linking lives through time Marital Status, Health and Mortality: The Role of Living Arrangement Paul Boyle, Peteke Feijten and Gillian Raab.
Synthetic Data within the Risk – Utility Framework Keith Spicer Office for National Statistics.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
TIME CONSTRAINTS, DURABLE CONSUMER GOODS AND THE PREVALENCE OF OBESITY IN WESTERN EUROPE Karsten Albæk.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Elementary Statistics Professor K. Leppel. Introduction and Data Collection.
Framework of Statistical Information. This is a typology of the categories or classes of statistical information. Remember the relationship between statistics.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.
Analytical Example Using NHIS Data Files John R. Pleis.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.
Descriptive Statistics using R. Summary Commands An essential starting point with any set of data is to get an overview of what you are dealing with You.
Yandell - Econ 216 Chap 1-1 Chapter 1 Introduction and Data Collection.
SESSION 1 & 2 Last Update 15 th February 2011 Introduction to Statistics.
3. MEASUREMENT and DATA COLLECTION
PROCESSING DATA.
Composite Measures Indexes Scales Typologies.
2008 Roper Public Opinion Poll on PBS
Taking Part 2008 Multivariate analysis December 2008
Matt Sobek Minnesota Population Center
INTRODUCTION AND DEFINITIONS
Job Satisfaction and Its Determinants Among Health Staffs in An Lao District Hospital, Hai Phong Tran Thi Thuy Ha Haiphong University of Medecine and Pharmacy,
Disclosure scenario and risk assessment: Structure of Earnings Survey
I. Introduction to statistics
Applications to Social Work Research
Rabia Khalaila, RN, MPH, PHD Director, Department of Nursing
Living in Fear, Living in Safety: A Cross-National Study
Graphical & Tabular Descriptive Techniques
1. Data Processing Sci Info Skills.
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.
Assessing Disclosure Risk in Microdata
Lecture 3 Variables, Relationships and Hypotheses
The Effects of Age and Sex on Marital Status
College of Nursing ● University of Kentucky ● Lexington, KY
UNECE Work Session on Gender Statistics Belgrade November, 2017
Multiple Imputation Using Stata
The ‘What’ and ‘Why’ of Vital statistics
Elementary Statistics Professor K. Leppel
Presentation 2b 2018 Census Products & Services Engagement.
Determinants Of Condom Use And HIV Status Disclosure To Sexual Partners Among Adults Receiving Antiretroviral Therapy At Ashaiman ART Clinic, Analysis.
Review Compare one sample to another value One sample t-test
IPUMS-International Integration Process
The European Statistical Training Programme (ESTP)
Practice Odometers measure automobile mileage. Suppose 12 cars drove exactly 10 miles and the following mileage figures were recorded. Determine if,
Artificial data in social science
SDMX Information Model: An Introduction
Protecting Confidential Data
6A Types of Data, 6E Measuring the Centre of Data
Chapter 4 Marriage & the Family
Asthma in Australia 2008 Tobacco smoke as a risk factor for asthma
The right time for a survey
Law Society of Scotland, Annual Members Survey 2018 Report by Mark Diffley Consultancy and Research Ltd.
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Chapter 1: Exploring Data
Data, Tables and Graphs Presentation.
Survey Design.
Treatment of statistical confidentiality Part 3: Generalised Output SDC Introductory course Trainer: Felix Ritchie CONTRACTOR IS ACTING UNDER A FRAMEWORK.
Cigarette Smoking Prevalence by SEC:
The 20 Most Common Causes of Cancer Death in 2012
SAFE – a method for anonymising the German Census
Item 2.2 Scientific Use Files for the Time Use Survey
Chapter 2 Sociologists Doing Research Section 1: Research Methods
Presentation transcript:

Recognising real people in synthetic microdata: risk mitigation and impact on utility Beata Nowok Chris Dibben & Gillian Raab Administrative Data Research Centre – Scotland Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Skopje, 20-22 September 2017

Completely synthetic microdata Statistical disclosure control (SDC) method All values of all variables are generated from statistical models - microdata set of artificial units only Original data are used to inform the models (to reproduce their essential features)

Completely synthetic microdata Your data will only be used to create synthetic data for public- use, and none of your data values will ever be released (Rubin 1993)

‘Real’ people in synthetic microdata Records, created by chance, that are similar or identical to the records in the real data

‘Real’ people in synthetic microdata Information on unique or rare individuals might be inherently disclosive People may ‘recognise’ themselves and believe that data are real - loss of reputation for the data collection agency

Risk mitigation Removal of problematic ‘real’ units from synthetic data Unique real individuals that are replicated as unique in the synthetic data (uniqueness judged by all or a subset of variables)

Observed Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61

Observed Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Synthetic Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350

Observed Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 57 VOCATIONAL/GRAMMAR MARRIED 800 PLEASED MALE 41 SECONDARY UNMARRIED 1500 MIXED 18 NA 78 PRIMARY/NO EDUCATION WIDOWED 900 54 MOSTLY SATISFIED 20 -8 39 2000 1197 38 MOSTLY DISSATISFIED 73 1700 30 68 DELIGHTED 61 Synthetic Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350 Sex Age Education Marital status Income Life satisfaction MALE 81 PRIMARY/NO EDUCATION MARRIED 2100 PLEASED 54 VOCATIONAL/GRAMMAR 1700 FEMALE 32 870 MIXED 98 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 18 UNMARRIED 158 28 1500 62 830 78 29 SECONDARY 580 59 1300 41 -8 73 WIDOWED 1350

Observed Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED 54 57 VOCATIONAL/GRAMMAR 800 PLEASED 1500 38 NA MOSTLY DISSATISFIED 18 UNMARRIED 73 1700 MALE 61 -8 1197 68 DELIGHTED 41 20 30 Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28

Observed Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 78 PRIMARY/NO EDUCATION WIDOWED 900 MIXED 39 SECONDARY MARRIED 2000 MOSTLY SATISFIED 54 57 VOCATIONAL/GRAMMAR 800 PLEASED 1500 38 NA MOSTLY DISSATISFIED 18 UNMARRIED 73 1700 MALE 61 -8 1197 68 DELIGHTED 41 20 30 Synthetic Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28 Sex Age Education Marital status Income Life satisfaction FEMALE 98 PRIMARY/NO EDUCATION MARRIED 800 MOSTLY DISSATISFIED 50 NA MOSTLY SATISFIED 62 830 73 WIDOWED 1350 29 SECONDARY 580 32 VOCATIONAL/GRAMMAR 870 MIXED 18 UNMARRIED 158 PLEASED MALE 81 2100 78 59 1300 41 1500 -8 54 1700 28

Removal of problematic records The scale of the deletion process The damage to the synthetic data quality

Data to be synthesised Variable name Description Data type Unique values sex Sex factor 2 smoke Smoking cigarettes 3 edu Highest educational qualification 5 marital Marital status placesize Category of the place of residence 6 ls Perception of life as a whole 8 socprof Socio-economic status 10 age Age numeric 79 income Personal monthly net income 406 bmi Body mass index 1,395 N = 5,000

Synthesis R package synthpop Series of conditional models - Classification and regression trees (CART) Synthesising order: sex, age, edu, placesize, socprof, marital, income, ls, smoke and bmi

Synthetic data 3 x 100 synthetic versions no smoothing smoothing during synthesis smoothing after synthesis Smoothing of all numeric variables: age, income, and bmi

Replication analysis A number of unique individuals in the observed data and a number of their unique replications in the synthetic data Uniqueness examined for all 1,023 possible combinations of the variables for all possible set sizes (between 1 and 10)

Sets of key variables used for identifying uniques and replications 99.8% Proportion of unique individuals in the original data Proportion of unique replications An average over 100 synthetic datasets (no smoothing) 17% Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Sets of key variables used for identifying uniques and replications Proportion of unique individuals in the original data Proportion of unique replications Sets of key variables used for identifying uniques and replications

Impact of sample size

Impact of smoothing

Impact on synthetic data quality

Impact on synthetic data quality

Concluding remarks No serious data damage caused by removal of replicated uniques was identified The condition that synthetic uniques have to be replications of real unique individuals prevents the extensive removal of less frequent value combinations

Concluding remarks Results are only specific to the considered example (data characteristics and synthesising method) A greater impact may be observed for a single synthetic data Careful investigation of removal impact and possible alternatives is recommended