Jerome Reiter Department of Statistical Science Duke University

Slides:

Advertisements

Similar presentations

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University

Advertisements

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.

Jörg Drechsler (Institute for Employment Research, Germany) NTTS 2009 Brussels, 20. February 2009 Disclosure Control in Business Data Experiences with.

Balancing Access and Confidentiality Jenny Telford Australian Bureau of Statistics September 2008.

1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.

Confidentiality risks of releasing measures of data quality Jerry Reiter Department of Statistical Science Duke University

© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.

Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.

A Measure of Disclosure Risk for Fully Synthetic Data Mark Elliot Manchester University Acknowledgements: Chris Dibben, Beata Nowak and Gillian Raab.

UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.

“OnTheMap” The Census Bureau’s New Tool for Residence-Workplace Analysis Fredrik Andersson and Jeremy Wu May 7, 2007 Daytona Beach, FL.

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.

JSM, Boston, August 8, 2014 Privacy, Big Data and The Public Good: Statistical Framework Stefan Bender (IAB)

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.

Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.

Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.

Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.

1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.

IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.

Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.

Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.

Jerry Reiter Department of Statistical Science and the Information Initiative at Duke Duke University.

Methods of Secure Computation and Data Integration Jerome Reiter, Duke University Alan Karr, NISS Xiaodong Lin, University of Cincinnati Ashish Sanil,

Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.

INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.

Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.

11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester

Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.

No Free Lunch: Working Within the Tradeoff Between Quality and Privacy

GS/PPAL Research Methods and Information Systems

The Development of Statistical Business Registers in

University of Texas at El Paso

Disclosure scenario and risk assessment: Structure of Earnings Survey

The treatment of uncertainty in the results

Differentially Private Verification of Regression Model Results

Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.

Multiple Imputation using SOLAS for Missing Data Analysis

Assessing Disclosure Risk in Microdata

Redesigning French structural business statistics, using more administrative data ICESIII, Montréal, june 2007.

The Development of Statistical Business Registers in

Dissemination Workshop for African countries on the Implementation of International Recommendations for Distributive Trade Statistics May 2008,

Multiple Imputation Using Stata

How to handle missing data values

Differential Privacy in Practice

Current Developments in Differential Privacy

Identifying Worker Characteristics Using LEHD and GIS

Ethical questions on the use of big data in official statistics

Discrete Event Simulation - 4

Towards a Fully Adjusted Census Database for the 2011 Census

Organization of efficient Economic Surveys

A new fantastic source for updating the Statistical Business Register

Martha Stinson. T. Kirk White. James Lawrence

The European Statistical Training Programme (ESTP)

Classification Trees for Privacy in Sample Surveys

ETS WG meeting 6-7 September 2006

Albania 2021 Population and Housing Census - Plans

On data accessibility and confidentiality……..

A bootstrap method for estimators based on combined administrative and survey data Sander Scholtus (Statistics Netherlands) NTTS Conference 13 March 2019.

Task Force on Small and Medium Sized Enterprise Data (SMED)

Key Considerations for Planning and Management of Census Operations

Stephanie Bond Huie, Ph.D., Vice Chancellor

Chapter 13: Item nonresponse

Mainstreaming essential For gender programmes For social programmes

A handbook on validation methodology. Metrics.

Challenging Times in Job Costing

Differential Privacy (1)

Key Considerations for Planning and Management of Census Operations

Presentation transcript:

Jerome Reiter Department of Statistical Science Duke University Multiple Imputation for Privacy Protection: Where Are We and Where Are We Going? Jerome Reiter Department of Statistical Science Duke University

General Setting Organization seeks to share confidential, record-level data with others Legal and ethical obligations to protect confidentiality “De-identification” often insufficient Intruders can match to other data files using common variables Traditional disclosure protection: alter/perturb data before release Low intensity perturbations not protective in digital data era High intensity perturbations seriously degrade quality in ways that are hard to account for in statistical inferences

Multiple Imputation for Disclosure Limitation Fully synthetic data proposed by Rubin (1993) Fit statistical models to the data, and simulate new records for public release Release multiple copies to enable researchers to estimate uncertainty Low risk, since matching to simulated records is not sensible Can preserve associations, keep tails, enable estimation at smaller geographical levels

What Have We Learned So Far? The original approach has been modified in practically relevant ways. The multiple imputation combining rules of Rubin (1987) are not appropriate for synthetic data variants. It is possible to create useful synthetic data products

Variants of the Approach Partially synthetic data (Little 1993, Reiter 2003) Simultaneous multiple imputation for missing data and disclosure control (Reiter 2004) Synthetic samples from census data (Drechsler and Reiter 2010) New synthesis methods and models Location data, both areal and point level Data nested within households Simultaneous synthesis and editing of erroneous values

Variances for the Variants Fully synthetic data require new inferential methods (Raghunathan, Reiter, Rubin 2003) Also the case for partial synthesis (Reiter 2003) And so on for the other variants…. Key insights One needs to think carefully about what to condition on How the data were generated determines the posterior distribution and hence the variance estimator

Synthetic Data Products Are in the Wild! Implementations by the Census Bureau Synthetic Longitudinal Business Database Synthetic Survey of Income and Program Participation American Community Survey group quarters data OnTheMap Other implementations by National Cancer Institute, Internal Revenue Service, and national statistics agencies abroad (UK, Germany, Canada, New Zealand)

Longitudinal Business Database (LBD) Business dynamics, job flows, market volatility, industrial organization… Economic census covering all private non-farm business establishments with paid employees Starts with 1976, updated annually >30 million establishments Commingled confidential data protected by US law (Title 13 and Title 26)

General Approach to Synthesizing LBD Generate predictive distribution of Y|X f( y1, y2, y3, …| X ) = f( y1 | X ) f( y2 | y1, X) f( y3 | y1, y2, X) ··· Use industry (NAICS) as “by” group Models include multinomials, classification trees, regression trees....

Variables in Synthetic LBD (phase 2)

What Do We Not Yet Know How to Do So Well? Quantify disclosure risks Generate data with high analytic validity for challenging settings High dimensional data (as in 100s of variables) with modest sample sizes Data with massive outliers Repeated releases of longitudinal data Enable users to determine how their particular analyses are affected by the synthesis process

Disclosure Risk in Synthetic Data Tend to have low risks of identification disclosure, since not meaningful to match synthetic records to actual individuals Inferential disclosure risks of more concern Synthesizer may perfectly predict some x for a certain type of individual, so synthetic x for individuals of this type always match actual x Related, synthesizer may be too accurate in predicting some values Assessing these risks conceptually feasible, but computationally hard

Utility of Synthetic Data Synthetic data inherit only the features baked into synthesis models Quality of results based on synthetic data dependent on quality of the synthesis models Synthetic data cannot preserve every analysis (otherwise we have the original data!) One approach: provide users feedback via a verification server (Barrientos, et al. 2018)

Where Are We Going? Increased use of synthetic data for public use products Too much information out there to feel comfortable with low intensity methods for public use products with unrestricted access… Unless we change our attitudes and laws about privacy and confidentiality Big push to satisfy formal guarantees of privacy Differential privacy Synthetic data have been proposed as a way to satisfy DP

Generating DP Synthetic Data To date, most DP synthesizers based on adding noise to sufficient statistics, which are then used to generate synthetic data Ex: create DP counts for disjoint subgroups, and make individual records from those noisy counts Often DP counts are post-processed to improve data quality Ensure non-negative integer counts Additivity across geographic hierarchies This creates complicated generative distributions Not obvious how to estimate moments or interval estimates in statistically principled ways

DP Synthetic Data and Multiple Imputation Can multiple imputation help here? Make multiple DP implicates from the generative algorithm Normal approximations for inference (Reiter 2003 rules, generally) But… each implicate leaks information about the confidential data Adherents to DP might conclude multiple implicates leak too much Research questions Better ways to measure privacy leakage from multiple DP implicates? Or from releasing one DP implicate plus MI variance estimates? New MI methods?