Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.

Slides:

Advertisements

Similar presentations

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University

Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.

Statistical Disclosure Control (SDC) for 2011 Census Progress Update Keith Spicer – ONS SDC Methodology 23 April 2009.

Progress on the SDC Strategy for the 2011 Census 23 rd June 2008 Keith Spicer and Caroline Young.

Weighting and Imputation for CORE Social Housing Statistics Julia Bowman & Niall Goulding.

CJT 765: Structural Equation Modeling Class 3: Data Screening: Fixing Distributional Problems, Missing Data, Measurement.

Survey Experiments. Defined Uses a survey question as its measurement device Manipulates the content, order, format, or other characteristics of the survey.

A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.

Metadata driven application for aggregation and tabular protection Andreja Smukavec SURS.

Introduction to plausible values National Research Coordinators Meeting Madrid, February 2010.

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.

Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.

1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.

1 Statistical Disclosure Control for Communal Establishments in the UK 2011 Census Joe Frend Office for National Statistics.

Plans for Access to UK Microdata from 2011 Census Emma White Office for National Statistics 24 May 2012.

2011 CENSUS Coverage Assessment – What’s new? OWEN ABBOTT.

1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.

Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.

1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.

Chapter 13 Multiple Regression

Inter-rater reliability in the KPG exams The Writing Production and Mediation Module.

Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.

Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.

Analysis of the characteristics of internet respondents to the 2011 Census to inform 2021 Census questionnaire design Orlaith Fraser & Cal Ghee.

Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

The 2011 Census: Estimating the Population Alexa Courtney.

Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.

Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, 5-7 September 2015, Helsinki Beata Nowok Administrative Data Research Centre – Scotland.

Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,

Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.

+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.

Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.

Looking for statistical twins

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

BINARY LOGISTIC REGRESSION

A Comparison of Two Nonprobability Samples with Probability Samples

Notes on Logistic Regression

Multiple Imputation using SOLAS for Missing Data Analysis

Progress towards a table builder with in-built disclosure control for 2021 Census Keith Spicer UNECE, 22 September 2017.

Theme (ii): New Data Sources and Census

Regression Analysis Part D Model Building

Assessing Disclosure Risk in Microdata

Beata Nowok Chris Dibben & Gillian Raab Administrative Data

Multiple Imputation Using Stata

Presentation 2b 2018 Census Products & Services Engagement.

Sub-regional workshop on integration of administrative data, big data

The European Statistical Training Programme (ESTP)

Introduction to Hypothesis Testing

Chapter 8: Estimating with Confidence

Disclosure Avoidance: An Overview

The European Statistical Training Programme (ESTP)

Chapter 8: Estimating with Confidence

Chapter: 9: Propensity scores

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

New Techniques and Technologies for Statistics 2017 Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Treatment of Missing Data Pres. 8

Perturbative methods for ESS census tables

Chapter 8: Estimating with Confidence

Source: IEEE Journal of Biomedical and Health Informatics, Vol

Chapter 13: Item nonresponse

Item 2.2 Scientific Use Files for the Time Use Survey

Imputation as a Practical Alternative to Data Swapping

Jerome Reiter Department of Statistical Science Duke University

Presentation transcript:

Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell

Outline Briefly introduce microdata and current microdata products Synthetic data generation methodology Utility analysis Next steps

Census Microdata Products User requirements Data access Detail in the data Data protection

2011 Census Microdata Products for England and Wales Safeguarded products Two individual level 5% samples Secure products 10% Individual sample 10% Household sample 5 products - 3 access levels Open access Teaching File (1%)

Synthetic Microdata Proposed product: Safeguarded 5% partially synthetic 2011 Census household microdata sample for England and Wales Enables wider access to meaningful census microdata Access for international users Access for commercial users

Synthetic Microdata Initial Research Initial test on one area (500,000 records) Select random 5% sample of test area to represent Microdata sample. Remove values from some variables Impute missing values using CANCEIS (Canadian Census Edit and Imputation System) All records will undergo some level of imputation

Why CANCEIS? The Imputation system used for the 2011 England and Wales Census. Plausible by user defined edit rules Parameters can be manipulated as required Time saving as removes the need to develop a bespoke imputation model

How it works Household Characteristics Missing Values Donor Pool

Output Household Imputed Data

Utility Analysis Narrow Measures - Marginal distributions Level of agreement Broad Measures - Propensity Score measure

Initial results- Marginal Distributions Gives a macro level indication of how well the synthetic data matches the original data

Initial results- Marginal Distributions

Initial results- Marginal Distributions

Assessing Utility – Agreement Kappa Statistic – Measures agreement between original and synthetic variables, taking into account the probability of agreement by chance Calculated using a transition matrix

Transition Matrix K = 0.83 K Strength of Agreement 0.00 Poor Marital Status Before Imputation Single Married Separated (still married) Divorced Widowed Civil Partnership Separated (still civil partnership) Dissolved Surviving After Imputation 43.5 1.2 0.1 0.5 0.0 0.8 37.8 0.4 0.2 1.6 0.6 6.4 4.6 Separated (Still civil partnership) K Strength of Agreement 0.00 Poor 0.00 - 0.20 Slight 0.21 - 0.40 Fair 0.41- 0.60 Moderate 0.61- 0.80 Substantial 0.81-1.00 Almost Perfect K = 0.83

Kappa Scores Variable Kappa Score (K) Health 0.77 Marital Status 0.83 Sex 0.69 Age 0.68 Carer 0.79 Student 0.86 Year last Worked 0.92 K Strength of Agreement 0.00 Poor 0.00 - 0.20 Slight 0.21 - 0.40 Fair 0.41- 0.60 Moderate 0.61- 0.80 Substantial 0.81-1.00 Almost Perfect

Assessing Utility – Broad measure Propensity score analysis Involves stacking the original and synthetic datasets and creating a dummy variable Dummy variable is used as the dependent variable in a logistic regression model Predicted probabilities close to 0.5

Propensity Score Analysis calculate the mean squared error to summarise overall level of utility When the original and synthetic data have similar distributions, Up is near 0. Up = 0.0020979

Next Step - Disclosure Risk Analysis The purpose of synthesising data is to minimise the possibility of sensitive information being disclosed. Traditional disclosure control methods Intruder testing Make a decision

Conclusion Utility analysis suggests distributions between original and synthetic data are similar Suggesting a high level of analytical validity Disclosure risk analysis to assess whether the imputation provides sufficient protection

Robert.Rendell@ons.gov.uk