Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.

Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau Sam.hawala@census.gov November 9 th, 2005

Outline of the talk Background of the project Confidentiality protection Disclosure analysis Conclusions

Linked SIPP-SSA-IRS Data The Longitudinal Employer-Household Dynamics (LEHD) Program created a confidential data set that integrates five SIPP panels (1990, 1991, 1992, 1993, 1996), and Earnings Records and SSA benefits data Data very useful to disability and retirement research communities LEHD will provide public-use version (PUF) of the integrated microdata using the synthetic data approach

Synthetic Data Fully-synthetic micro data –Uses the population or record linkage structure of the gold standard micro data –Generates synthetic entities and data elements from appropriate probability models Partially-synthetic micro data –Preserves the record structure or sampling frame of the gold standard micro data –Replaces the data elements with synthetic values sampled from an appropriate probability model

Data Confidentiality Public product should prevent individuals from being re-identified in the current public use SIPP products Limit number of SIPP variables included Protect survey data, administrative data, and the links between the files

Confidentiality Protection Protection is based on the inability of PUF users to re-identify the SIPP record upon which the PUF record is based This prevents wholesale addition of SIPP data to the IRS and SSA data in the PUF Goal: re-identification of SIPP records from the PUF should result in true matches and false matches with equal probability

Disclosure Analysis Uses probabilistic record linking Each synthetic implicate is matched back to the original file All unsynthesized variables are used as blocking variables

Matching the Files Two files A (original confidential data file) and B (synthetic data file)… over 200,000 records in each Blocking criterion (unsynthesized variables) Matching set of variables Agreement criterion (M and U probabilities)

Basic Results

Refinements Suggestd by the Disclosure Review Board The ratios of true matches to false matches should be close to 1. The overall count of matches should be reduced. Investigate a method to optimally choose the probabilities for the conditional matching and non-matching agreements

Conclusion Confidentiality is an increasing problem for agencies releasing public use data Linked longitudinal worker-employer data is difficult to protect through usual methods Probabilistic record linkage technology can be a powerful way to assess when data may be at risk.

Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.

Similar presentations

Presentation on theme: "Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.

Similar presentations

Presentation on theme: "Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005."— Presentation transcript:

Similar presentations

About project

Feedback