Download presentation
Presentation is loading. Please wait.
Published byEsmond Bond Modified over 9 years ago
1
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau Sam.hawala@census.gov November 9 th, 2005
2
Outline of the talk Background of the project Confidentiality protection Disclosure analysis Conclusions
3
Linked SIPP-SSA-IRS Data The Longitudinal Employer-Household Dynamics (LEHD) Program created a confidential data set that integrates five SIPP panels (1990, 1991, 1992, 1993, 1996), and Earnings Records and SSA benefits data Data very useful to disability and retirement research communities LEHD will provide public-use version (PUF) of the integrated microdata using the synthetic data approach
4
Synthetic Data Fully-synthetic micro data –Uses the population or record linkage structure of the gold standard micro data –Generates synthetic entities and data elements from appropriate probability models Partially-synthetic micro data –Preserves the record structure or sampling frame of the gold standard micro data –Replaces the data elements with synthetic values sampled from an appropriate probability model
5
Data Confidentiality Public product should prevent individuals from being re-identified in the current public use SIPP products Limit number of SIPP variables included Protect survey data, administrative data, and the links between the files
6
Confidentiality Protection Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF Goal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability
7
Disclosure Analysis Uses probabilistic record linking Each synthetic implicate is matched back to the original file All unsynthesized variables are used as blocking variables
8
Matching the Files Two files A (original confidential data file) and B (synthetic data file)… over 200,000 records in each Blocking criterion (unsynthesized variables) Matching set of variables Agreement criterion (M and U probabilities)
9
Basic Results
11
Refinements Suggestd by the Disclosure Review Board The ratios of true matches to false matches should be close to 1. The overall count of matches should be reduced. Investigate a method to optimally choose the probabilities for the conditional matching and non-matching agreements
12
Conclusion Confidentiality is an increasing problem for agencies releasing public use data Linked longitudinal worker-employer data is difficult to protect through usual methods Probabilistic record linkage technology can be a powerful way to assess when data may be at risk.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.