Download presentation
Presentation is loading. Please wait.
Published byΧαρικλώ Αθανασιάδης Modified over 5 years ago
1
Jerome Reiter Department of Statistical Science Duke University
Multiple Imputation for Privacy Protection: Where Are We and Where Are We Going? Jerome Reiter Department of Statistical Science Duke University
2
General Setting Organization seeks to share confidential, record-level data with others Legal and ethical obligations to protect confidentiality “De-identification” often insufficient Intruders can match to other data files using common variables Traditional disclosure protection: alter/perturb data before release Low intensity perturbations not protective in digital data era High intensity perturbations seriously degrade quality in ways that are hard to account for in statistical inferences
3
Multiple Imputation for Disclosure Limitation
Fully synthetic data proposed by Rubin (1993) Fit statistical models to the data, and simulate new records for public release Release multiple copies to enable researchers to estimate uncertainty Low risk, since matching to simulated records is not sensible Can preserve associations, keep tails, enable estimation at smaller geographical levels
4
What Have We Learned So Far?
The original approach has been modified in practically relevant ways. The multiple imputation combining rules of Rubin (1987) are not appropriate for synthetic data variants. It is possible to create useful synthetic data products
5
Variants of the Approach
Partially synthetic data (Little 1993, Reiter 2003) Simultaneous multiple imputation for missing data and disclosure control (Reiter 2004) Synthetic samples from census data (Drechsler and Reiter 2010) New synthesis methods and models Location data, both areal and point level Data nested within households Simultaneous synthesis and editing of erroneous values
6
Variances for the Variants
Fully synthetic data require new inferential methods (Raghunathan, Reiter, Rubin 2003) Also the case for partial synthesis (Reiter 2003) And so on for the other variants…. Key insights One needs to think carefully about what to condition on How the data were generated determines the posterior distribution and hence the variance estimator
7
Synthetic Data Products Are in the Wild!
Implementations by the Census Bureau Synthetic Longitudinal Business Database Synthetic Survey of Income and Program Participation American Community Survey group quarters data OnTheMap Other implementations by National Cancer Institute, Internal Revenue Service, and national statistics agencies abroad (UK, Germany, Canada, New Zealand)
8
Longitudinal Business Database (LBD)
Business dynamics, job flows, market volatility, industrial organization… Economic census covering all private non-farm business establishments with paid employees Starts with 1976, updated annually >30 million establishments Commingled confidential data protected by US law (Title 13 and Title 26)
9
General Approach to Synthesizing LBD
Generate predictive distribution of Y|X f( y1, y2, y3, …| X ) = f( y1 | X ) f( y2 | y1, X) f( y3 | y1, y2, X) ··· Use industry (NAICS) as “by” group Models include multinomials, classification trees, regression trees....
10
Variables in Synthetic LBD (phase 2)
14
What Do We Not Yet Know How to Do So Well?
Quantify disclosure risks Generate data with high analytic validity for challenging settings High dimensional data (as in 100s of variables) with modest sample sizes Data with massive outliers Repeated releases of longitudinal data Enable users to determine how their particular analyses are affected by the synthesis process
15
Disclosure Risk in Synthetic Data
Tend to have low risks of identification disclosure, since not meaningful to match synthetic records to actual individuals Inferential disclosure risks of more concern Synthesizer may perfectly predict some x for a certain type of individual, so synthetic x for individuals of this type always match actual x Related, synthesizer may be too accurate in predicting some values Assessing these risks conceptually feasible, but computationally hard
16
Utility of Synthetic Data
Synthetic data inherit only the features baked into synthesis models Quality of results based on synthetic data dependent on quality of the synthesis models Synthetic data cannot preserve every analysis (otherwise we have the original data!) One approach: provide users feedback via a verification server (Barrientos, et al. 2018)
17
Where Are We Going? Increased use of synthetic data for public use products Too much information out there to feel comfortable with low intensity methods for public use products with unrestricted access… Unless we change our attitudes and laws about privacy and confidentiality Big push to satisfy formal guarantees of privacy Differential privacy Synthetic data have been proposed as a way to satisfy DP
18
Generating DP Synthetic Data
To date, most DP synthesizers based on adding noise to sufficient statistics, which are then used to generate synthetic data Ex: create DP counts for disjoint subgroups, and make individual records from those noisy counts Often DP counts are post-processed to improve data quality Ensure non-negative integer counts Additivity across geographic hierarchies This creates complicated generative distributions Not obvious how to estimate moments or interval estimates in statistically principled ways
19
DP Synthetic Data and Multiple Imputation
Can multiple imputation help here? Make multiple DP implicates from the generative algorithm Normal approximations for inference (Reiter 2003 rules, generally) But… each implicate leaks information about the confidential data Adherents to DP might conclude multiple implicates leak too much Research questions Better ways to measure privacy leakage from multiple DP implicates? Or from releasing one DP implicate plus MI variance estimates? New MI methods?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.