Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.
Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young.
Weighting and Imputation for CORE Social Housing Statistics Julia Bowman & Niall Goulding.
CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.
Survey Experiments. Defined Uses a survey question as its measurement device Manipulates the content, order, format, or other characteristics of the survey.
A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.
Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.
Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 Statistical Disclosure Control for Communal Establishments in the UK 2011 Census Joe Frend Office for National Statistics.
Plans for Access to UK Microdata from 2011 Census Emma White Office for National Statistics 24 May 2012.
2011 CENSUS Coverage Assessment – What’s new? OWEN ABBOTT.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
Chapter 13 Multiple Regression
Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Analysis of the characteristics of internet respondents to the 2011 Census to inform 2021 Census questionnaire design Orlaith Fraser & Cal Ghee.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
The 2011 Census: Estimating the Population Alexa Courtney.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.
Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Looking for statistical twins
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
BINARY LOGISTIC REGRESSION
A Comparison of Two Nonprobability Samples with Probability Samples
Notes on Logistic Regression
Multiple Imputation using SOLAS for Missing Data Analysis
Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.
Theme (ii): New Data Sources and Census
Regression Analysis Part D Model Building
Assessing Disclosure Risk in Microdata
Beata Nowok Chris Dibben & Gillian Raab Administrative Data
Multiple Imputation Using Stata
Presentation 2b 2018 Census Products & Services Engagement.
Sub-regional workshop on integration of administrative data, big data
The European Statistical Training Programme (ESTP)
Introduction to Hypothesis Testing
Chapter 8: Estimating with Confidence
Disclosure Avoidance: An Overview
The European Statistical Training Programme (ESTP)
Chapter 8: Estimating with Confidence
Chapter: 9: Propensity scores
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
New Techniques and Technologies for Statistics 2017  Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Treatment of Missing Data Pres. 8
Perturbative methods for ESS census tables
Chapter 8: Estimating with Confidence
Source: IEEE Journal of Biomedical and Health Informatics, Vol
Chapter 13: Item nonresponse
Item 2.2 Scientific Use Files for the Time Use Survey
Imputation as a Practical Alternative to Data Swapping
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell

Outline Briefly introduce microdata and current microdata products Synthetic data generation methodology Utility analysis Next steps

Census Microdata Products User requirements Data access Detail in the data Data protection

2011 Census Microdata Products for England and Wales Safeguarded products Two individual level 5% samples Secure products 10% Individual sample 10% Household sample 5 products - 3 access levels Open access Teaching File (1%)

Synthetic Microdata Proposed product: Safeguarded 5% partially synthetic 2011 Census household microdata sample for England and Wales Enables wider access to meaningful census microdata Access for international users Access for commercial users

Synthetic Microdata Initial Research Initial test on one area (500,000 records) Select random 5% sample of test area to represent Microdata sample. Remove values from some variables Impute missing values using CANCEIS (Canadian Census Edit and Imputation System) All records will undergo some level of imputation

Why CANCEIS? The Imputation system used for the 2011 England and Wales Census. Plausible by user defined edit rules Parameters can be manipulated as required Time saving as removes the need to develop a bespoke imputation model

How it works Household Characteristics Missing Values Donor Pool

Output Household Imputed Data

Utility Analysis Narrow Measures - Marginal distributions Level of agreement Broad Measures - Propensity Score measure

Initial results- Marginal Distributions Gives a macro level indication of how well the synthetic data matches the original data

Initial results- Marginal Distributions

Initial results- Marginal Distributions

Assessing Utility – Agreement Kappa Statistic – Measures agreement between original and synthetic variables, taking into account the probability of agreement by chance Calculated using a transition matrix

Transition Matrix K = 0.83 K Strength of Agreement 0.00 Poor Marital Status Before Imputation Single Married Separated (still married) Divorced Widowed Civil Partnership Separated (still civil partnership) Dissolved Surviving After Imputation 43.5 1.2 0.1 0.5 0.0 0.8 37.8 0.4 0.2 1.6 0.6 6.4 4.6 Separated (Still civil partnership) K Strength of Agreement 0.00 Poor 0.00 - 0.20 Slight 0.21 - 0.40 Fair 0.41- 0.60 Moderate 0.61- 0.80 Substantial 0.81-1.00 Almost Perfect K = 0.83

Kappa Scores Variable Kappa Score (K) Health 0.77 Marital Status 0.83 Sex 0.69 Age 0.68 Carer 0.79 Student 0.86 Year last Worked 0.92 K Strength of Agreement 0.00 Poor 0.00 - 0.20 Slight 0.21 - 0.40 Fair 0.41- 0.60 Moderate 0.61- 0.80 Substantial 0.81-1.00 Almost Perfect

Assessing Utility – Broad measure Propensity score analysis Involves stacking the original and synthetic datasets and creating a dummy variable Dummy variable is used as the dependent variable in a logistic regression model Predicted probabilities close to 0.5

Propensity Score Analysis calculate the mean squared error to summarise overall level of utility When the original and synthetic data have similar distributions, Up is near 0. Up = 0.0020979

Next Step - Disclosure Risk Analysis The purpose of synthesising data is to minimise the possibility of sensitive information being disclosed. Traditional disclosure control methods Intruder testing Make a decision

Conclusion Utility analysis suggests distributions between original and synthetic data are similar Suggesting a high level of analytical validity Disclosure risk analysis to assess whether the imputation provides sufficient protection

Robert.Rendell@ons.gov.uk