Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.
Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with.
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Non response and missing data in longitudinal surveys.
Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University
John M. Abowd Cornell University and Census Bureau
Prediction and Imputation in ISEE - Tools for more efficient use of combined data sources Li-Chun Zhang, Statistics Norway Svein Nordbotton, University.
The Smith Consulting Group1 Ethics and Accountability Bob Smith The Smith Consulting Group Spring 2004 Conference Oklahoma Association for Instructional.
Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
NLSCY – Non-response. Non-response There are various reasons why there is non-response to a survey  Some related to the survey process Timing Poor frame.
© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
GS/PPAL Section N Research Methods and Information Systems A QUANTITATIVE RESEARCH PROJECT - (1)DATA COLLECTION (2)DATA DESCRIPTION (3)DATA ANALYSIS.
A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.
UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.
INFO 7470/ILRLE 7400 Survey of Income and Program Participation (SIPP) Synthetic Beta File John M. Abowd and Lars Vilhuber April 26, 2011.
AM Recitation 2/10/11.
Regional Seminar on Census Data Archiving for Africa, Addis Ababa, Ethiopia, September 2011 Overview of Archiving of Microdata Session 4 United Nations.
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
1 Statistical Disclosure Control Methods for Census Outputs Natalie Shlomo SDC Centre, ONS January 11, 2005.
JSM, Boston, August 8, 2014 Privacy, Big Data and The Public Good: Statistical Framework Stefan Bender (IAB)
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Handling Attrition and Non- response in the 1970 British Cohort Study Tarek Mostafa Institute of Education – University of London.
Comments: The Big Picture for Small Areas Alan M. Zaslavsky Harvard Medical School.
Sampling And Resampling Risk Analysis for Water Resources Planning and Management Institute for Water Resources May 2007.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
1 Measures of Disclosure Risk and Harm Measures of Disclosure Risk and Harm Diane Lambert, Journal of Official Statistics, 9 (1993), pp Jim Lynch.
ISCTSC Workshop A7 Best Practices in Data Fusion.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Statistical Confidentiality: Is Synthetic Data the Answer? George Duncan 2006 February 13.
Marketing Information System A Marketing Information System is the structure of people, equipment, and procedures used to gather, analyze, and distribute.
Jerry Reiter Department of Statistical Science and the Information Initiative at Duke Duke University.
Methods of Secure Computation and Data Integration Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil,
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
7/14/2003(c) 2003 Strategic Matching, Inc.1 29 th International Traffic Records Forum Using Multiple Imputation to Resolve Missing Data Issues.
Multiple Imputation: Methods and Applications Jerry Reiter Department of Statistical Science Information Initiative at Duke Duke University
WELCOME TO BIOSTATISTICS! WELCOME TO BIOSTATISTICS! Course content.
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Data Confidentiality and the Common Good.
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
How to handle missing data values
The European Statistical Training Programme (ESTP)
Classification Trees for Privacy in Sample Surveys
Non response and missing data in longitudinal surveys
Chapter 13: Item nonresponse
Open Data Sharing and its Statistical Limitations
Imputation as a Practical Alternative to Data Swapping
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University

General setting Agency seeks to release data on individuals. Risk of re-identifications from matching to external databases. Statistical disclosure limitation applied to data before release.

Standard approaches to disclosure limitation Suppress data Add random noise Recode variables Swap data

Bottom-coding/Top-coding Affects quality of released microdata Many data analysts’ questions about people in tails of distributions Questions about whole populations affected by loss of detail in tails

Data Swapping Affects quality of released microdata Associations weakened. Weighted means not accurate. Implausible swaps, especially for GQ data. Analysts cannot determine effects on inferences. May not guarantee confidentiality Unswapped records still at risk.

Another approach: Partially synthetic data Release multiple, partially synthetic datasets so that: Released data comprise mix of observed and synthetic values. Released data look like actual data. Statistical procedures valid for original data are valid for released data. Little (1993, JOS ), Reiter (2003, 2004 Surv. Meth )

Existing applications Replace sensitive values for selected units: Kennickel (1997, Record Linkage Techniques ). Replace values of identifiers for selected units: Liu and Little (2002, JSM Proceedings), Current research with Sam and Rolando. Replace all values of sensitive variables: Abowd and Woodcock (2001, Confid., Discl., and Data Access), Survey of Income and Program Participation. Longitudinal Business Database.

Advantages of partially synthetic data Confidentiality protected since risky identifiers or sensitive values not genuine. Replacements come from realistic models, so associations are preserved (as long as model is good). Varying amounts of synthesis can be done, depending on risk/utility tradeoff. Provide information in tails of distributions and release finer geographic detail. Agency can describe imputation model, so that analysts have a sense how their results are affected.

Why multiple copies? Benefits in usefulness of data: Enables analysts to estimate additional uncertainty due to replacements. Can deal simultaneously with item nonresponse using multiple imputation. Additional risks from releasing multiple copies: Intruder can average replacement values. Highlights which values have been synthesized.

Handling missing and synthetic data simultaneously Reiter (2004, Survey Methodology) Create m completed datasets using MI for missing data. For each completed dataset, create r replacement datasets using MI for partially synthetic data. Release M = mr datasets to public.

Inference with missing and partially synthetic data Reiter (2004, Survey Methodology) Estimand: Q = Q (X, Y ) In each synthetic dataset

Quantities needed for inference

Inference with missing and partially synthetic data Estimate of Q : Estimate of variance is For large n, m, and r, use normal based inference for Q:

Ongoing research Semi-parametric and non-parametric data generation methods. Risk/usefulness profile on genuine data in production setting. Packaged synthesizers.