INFO 7470/ILRLE 7400 Survey of Income and Program Participation (SIPP) Synthetic Beta File John M. Abowd and Lars Vilhuber April 26, 2011.

Slides:

Advertisements

Similar presentations

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA.

Advertisements

Nonparametric estimation of nonresponse distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009.

1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.

INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.

Random Assignment Experiments

The estimation strategy of the National Household Survey (NHS) François Verret, Mike Bankier, Wesley Benjamin & Lisa Hayden Statistics Canada Presentation.

John M. Abowd Cornell University and Census Bureau

1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.

© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.

Econometric Details -- the market model Assume that asset returns are jointly multivariate normal and independently and identically distributed through.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:

INFO 4470/ILRLE 4470 Social and Economic Data Populations and Frames John M. Abowd and Lars Vilhuber February 7, 2011.

Evaluating Hypotheses

© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.

Chapter 11 Multiple Regression.

Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

1 Health Status and The Retirement Decision Among the Early-Retirement-Age Population Shailesh Bhandari Economist Labor Force Statistics Branch Housing.

Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.

Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.

Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.

Sampling: Theory and Methods

by B. Zadrozny and C. Elkan

 1  Outline  stages and topics in simulation  generation of random variates.

1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.

12th Meeting of the Group of Experts on Business Registers

Chapter Twelve Census: Population canvass - not really a “sample” Asking the entire population Budget Available: A valid factor – how much can we.

Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.

Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.

9 Mar 2007 EMBnet Course – Introduction to Statistics for Biologists Nonparametric tests, Bootstrapping

© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.

Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.

Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,

Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.

Panel Study of Entrepreneurial Dynamics Richard Curtin University of Michigan.

Small Area Health Insurance Estimates (SAHIE) Program Joanna Turner, Robin Fisher, David Waddington, and Rick Denby U.S. Census Bureau October 6, 2004.

Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.

1 Reengineering the SIPP: An Assessment of the Use of Administrative Records Jim Farber and Sally Obenski US Census Bureau CNSTAT Panel January 26, 2007.

Topic (vi): New and Emerging Methods Topic organizer: Maria Garcia (USA) UNECE Work Session on Statistical Data Editing Oslo, Norway, September 2012.

Things that May Affect the Estimates from the American Community Survey Updated February 2013.

WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.

Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.

© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.

© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.

Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.

Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.

Bangor Transfer Abroad Programme Marketing Research SAMPLING (Zikmund, Chapter 12)

Statistics Canada Citizenship and Immigration Canada Methodological issues.

Some type of major TSE effort TSE for an “important” statistic Form a group to design a TSE evaluation for existing survey “exemplary TSE estimation plan”

INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.

INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.

INFO 7470/ECON 7400/ILRLE 7400 Register-based statistics John M. Abowd and Lars Vilhuber March 4, 2013 and April 4, 2016.

Introduction to emulators Tony O’Hagan University of Sheffield.

INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.

Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.

Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.

An Introduction to the SIPP Synthetic Beta Holly Monti and Lori Reeder University of Michigan August 8,

Multiple Imputation using SOLAS for Missing Data Analysis

Multiple Imputation Using Stata

How to handle missing data values

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

The European Statistical Training Programme (ESTP)

What are their purposes? What kinds?

Chapter 13: Item nonresponse

Presentation transcript:

INFO 7470/ILRLE 7400 Survey of Income and Program Participation (SIPP) Synthetic Beta File John M. Abowd and Lars Vilhuber April 26, 2011

Elements 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 2 Survey data Issue: (item) nonresponse Administrative data Issue: inexact link records Match-merged and completed complex integrated data Issue: too much detail leads to disclosure avoidance issues Public-use data With novel detail New analysis using public- use data with novel detail Issue: are the results right?

Survey of Income and Program Participation (SIPP) Goal of SIPP: accurate info about income and program participation of individuals and households and its principal determinants Information: – cash and noncash income on a sub-annual basis. – taxes, assets, liabilities – participation in government transfer programs 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 3

Background In 2001, a new regulation authorized the Census Bureau and SSA to link SIPP and CPS data to SSA and IRS administrative data for research purposes Idea for a public use file was motivated by a desire to allow outside access to long administrative record histories of earnings and benefits linked to household demographic data These data allow detailed statistical and simulation study of retirement and disability programs Census Bureau, Social Security Administration, Internal Revenue Service, and Congressional Budget Office all participated in development

Genesis of SSB A portion of the SIPP user community was primarily interested in national retirement and disability programs SIPP augmented with – earnings histories from the IRS data maintained at SSA (W-2) – benefit data from SSA’s master beneficiary records. Feasibility assessment (confidentiality!) of adding SIPP variables to earnings/benefit data in a public-use file (PUF) – set of variables that could be added without compromising the confidentiality protection of the existing SIPP public use files was VERY limited alternative methods explored 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 5

SSB basic methodology Experiment using “synthetic data” In fact: partially synthetic data with multiple imputation of missing items “partially synthetic data”: – some (at least one!) variables are actual responses – other variables are replaced by values sampled from the posterior predictive distribution for that record, conditional on all of the confidential data. 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 6

History of SSB : Creation, but not release, of three versions of the “SIPP/SSA/IRS-PUF” (SSB) 2006: Release to limited public access of SSB V4.2 – Access to general public only at Cornell-hosted Virtual RDC (SSB server: restricted-access setup) With promise of evaluation of Virtual RDC-run programs on internal Gold Standard – Ongoing SSA evaluation – Ongoing evaluation at Census (in RDC) 2010: Release of SSB V5 at Census and on the Virtual RDC (codebook: )SSB V5 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 7

Basic structure of SSB V4 SIPP – Core set of 125 SIPP variables in a standardized extract of SIPP panels and 1996 – All missing data items (except for structurally missing) are marked for imputation IRS – maintained at SSA, but derived from IRS records – Master summary earnings records (SER) – Master detailed earnings records (DER) 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 8

Basic structure SSB v4 (2) SSA – Master Beneﬁciary Record (MBR) Census – Numident: administrative birth and death dates All files combined using (verified) SSNs => “Gold Standard” 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 9

Basic Structure of SSB V5 Panels: 1990, 1991, 1992, 1993, 1996, 2001, and 2004 (this variable is now in the SSB) Couple-level linkage: the first person to whom the SIPP respondent was married during the time period covered by the SIPP panel SIPP variables only appear in years appropriate for the panel indicated by the PANEL variable (biggest change from V4.2) 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 10

Missing values in Gold Standard Values may be missing due to – [survey] Non-response – [survey] Question not being asked in a particular panel – [admin] Failure to link to administrative record (non-validated SSN) – [both] structural missing (e.g., income of spouse if not married) 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 11

Scope of Synthesis Never missing and not synthesized – gender, – marital status – spouse’s gender – initial type of Social Security benefits – type of Social Security benefits in 2000 – spouse’s benefits type variables All other variables in the public use file were synthesized. 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 12

Common Structure to Multiple Imputation and Synthesis Hierarchical tree of variable relationships (parent-child relationship, accounting for structure) At each node, independent SRMI is used – a statistical model is estimated for each of the variables at the same level Bayesian bootstrap, logistic regression, or linear regression – Statistical models are estimated separately for groups of individuals – Then, a proper posterior predictive distribution is estimated – Given a PPD, each variable is imputed /synthesized, conditional on all values of all other variables for that record The next node is processed 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 13

MI and Synth Initial iterations for missing data imputation, keeping all observed values where available Final iteration is for data synthesis (replacing all observed values, see exceptions) 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 14

Latest Release of SSB 2010: Release of limited public access of SSB v5 4/26/2011 © John M. Abowd and Lars Vilhuber 2011, all rights reserved 15

Framework for SIPP/SSA/IRS Synthetic Beta Link SIPP panels with lifetime earnings and benefit histories from IRS and SSA Keep a few key variables unchanged Choose and synthesize other SIPP, IRS, and SSA variables subject to the following requirements: – List of variables must be long enough to be useful to some group of researchers – Multivariate relationships across synthesized variables must be analytically valid: shorter lists, less distortion – disclosure avoidance challenge: users cannot re-identify source records in existing SIPP public use file: shorter lists, more distortion

Important Design Decisions Four variables remain unsynthesized: gender, marital status, benefit status (initial and 2000); also link to spouse is not perturbed Hundreds variables would be synthesized Variables chosen for target users: disability and retirement research communities Use SRMI and Bayesian Bootstrap methods for data synthesis Create “gold standard” data and compare to synthetic data to assess analytic validity Try to match back to public use SIPP to test disclosure avoidance problems

Create Gold Standard Data Create a data extract from the SIPP panels conducted in the 1990s – Seven panels: 1990, 1991, 1992, 1993, 1996, 2001, 2004 – Data from core and topical module survey questions Standardize variables across panels Link to Summary/Detailed Earnings Records and SSA benefits data These data are the “truth.” Any synthetic data should preserve the characteristics of and relationships among the variables on this file.

SIPP Variables Codebook

Synthetic Data Creation Purpose of synthetic data is to create micro data that can be used by researchers in the same manner as the original data while preserving the confidentiality of respondents’ identities Fundamental trade-off: usefulness and analytic validity of data versus protection from disclosure Our goal: not be able to re-identify anyone in the already released SIPP public use files while still preserving regression results

Multiple Imputation Confidentiality Protection Denote confidential data by Y and non- confidential data by X. Y contains missing data so that Y=(Y obs, Y mis ) and X has no missing data. Use the posterior predictive distribution(PPD) p(Y mis | Y obs, X) to complete missing data and p(Y| Y m, X) to create synthetic data Data synthesis is same procedure as missing data imputation, just done for all observations Major emphasis is to find a good estimate of the PPD

Testing Analytical Validity Run regressions on each synthetic implicate – Average coefficients – Combine standard errors using formulae that take account of average variance of estimates (within implicate variance) and differences in variance across estimates (between implicate variance). Run regressions on gold standard data Compare average synthetic coefficient and standard error to g.s. coefficient and s.e. Data are analytically valid if coefficient is unbiased and the same inferences are drawn

Formulae: Completed Data only Notation – script ℓ is index for missing data implicate – m is total number of missing data implicates Estimate from one completed implicate Average of statistic across implicates

Formulae: Total Variance Between Variance – variation due to differences between implicates Total variance of average statistic Variance of the statistic across implicates: between variance

Formulae: Within Variance Variation due to differences within each implicate Variance of the statistic from each completed implicate Average variance of statistic: within variance

Formulae: Synthetic and Completed Implicates Notation –script ℓ is index for missing data implicate –script k is index for synthetic data implicate –m is total number of missing data implicates –r is total number of synthetic implicates per missing data implicate Estimate from one synthetic implicate Average of statistic across synthetic implicates

Formulae: Grand Mean and Total Variance Average of statistic across all implicates Total variance of average statistic

Formulae: Between Variance Variation due to differences between implicates Variance of the statistic across missing data implicates: between m implicate variance Variance of the statistic across synthetic data implicates: between r implicate variance

Formulae: Within Variance Variation due to differences within each implicate Variance of the statistic on each implicate Average variance of statistic: within variance Source: Reiter, Survey Methodology (2004):

Example: Average AIME/AMW Estimate average on each of synthetic implicates – AvgAIME (1,1), AvgAIME (1,2), AvgAIME (1,3), AvgAIME (1,4), – AvgAIME (2,1), AvgAIME (2,2), AvgAIME (2,3), AvgAIME (2,4), – AvgAIME (3,1), AvgAIME (3,2), AvgAIME (3,3), AvgAIME (3,4), – AvgAIME (4,1), AvgAIME (4,2), AvgAIME (4,3), AvgAIME (4,4) Estimate mean for each set of synthetic implicates that correspond to one completed implicate – AvgAIMEAVG (1), AvgAIMEAVG (2), AvgAIMEAVG (3), AvgAIMEAVG (4) Estimate grand mean of all implicates – AvgAIMEGRANDAVG

Example (cont.) Between m implicate variance Between r implicate variance

Example (cont.) Variance of mean from each implicate – VAR[AvgAIME (1,1) ], VAR[AvgAIME (1,2) ], VAR[AvgAIME (1,3) ], VAR[AvgAIME (1,4) ] – VAR[AvgAIME (2,1) ], VAR[AvgAIME (2,2) ], VAR[AvgAIME (2,3) ], VAR[AvgAIME (2,4) ] – VAR[AvgAIME (3,1) ], VAR[AvgAIME (3,2) ], VAR[AvgAIME (3,3) ], VAR[AvgAIME (3,4) ] – VAR[AvgAIME (4,1) ], VAR[AvgAIME (4,2) ], VAR[AvgAIME (4,3) ], VAR[AvgAIME (4,4) ] Within variance

Example (cont.) Total Variance Use AvgAIMEGRANDAVG and Total Variance to calculate confidence intervals and compare to estimate from completed data

SAS Programs Sample programs to calculate total variance and confidence intervals

Results: Average AIME

Public Use of the SIPP Synthetic Beta Full version (16 implicates) released to the Cornell Virtual RDC Any researcher may use these data During the testing phase, all analyses must be performed on the Virtual RDC Census Bureau research team will run the same analysis on the completed confidential data Results of the comparison will be released to the researcher, Census Bureau, SSA, and IRS (after traditional disclosure avoidance analysis of the runs on the confidential data)

Methods for Estimating the PPD Sequential Regression Multivariate Imputation (SRMI) is a parametric method where PPD is defined as The BB is a non-parametric method of taking draws from the posterior predictive distribution of a group of variables that allows for uncertainty in the sample CDF We use BB for a few groups of variables with particularly complex relationships and use SRMI for all other variables

SRMI Method Details Assume a joint density p(Y,X,θ) that defines parametric relationships between all observed variables. Approximate the joint density by a sequence of conditional densities defined by generalized linear models. Same process for completing and synthesizing data Synthetic values of some are draws from: where Y m, X m are completed data, and densities p k are defined by an appropriate generalized linear model and prior.

SRMI Details: KDE Transforms The SRMI models for continuous variables assume that they are conditionally normal This assumption is relaxed by performing a KDE- based transform of groups of related variables All variables in the group are transformed to normality, then the PPD is estimated The sampled values from PPD are inverse transformed back to the original distribution using the inverse cumulative distribution

SRMI Example: Synthesizing Date of Birth Divide individuals into homogeneous groups using stratification variables – example: male, black, age categories, education categories, marital status – example: decile of lifetime earnings distribution, decile of lifetime years worked distribution, worked previous year, worked current year For each group, estimate an independent linear regression of date of birth on other variables (not used for stratification) that are strongly related

SRMI Example: Synthesizing Date of Birth Synthetic date of birth is a random variable Before analysis, it is transformed to normal using the KDE- based procedure Distribution has two sources of variation: – variation in error term in regression model – variation in estimated parameters:  ’s and  2 Synthetic values are draws from this distribution Synthetic values are inverse transformed back to the original distribution using the inverse cumulative distribution.

Bayesian Bootstrap Method Details Divide data into homogeneous groups using similar stratification variables as in SRMI Within groups do a Bayesian bootstrap of all variables to be synthesized at the same time. – n observations in a group, draw 1-n random variables from uniform (0,1) distribution – let u o … u i … u n define the ordering of the observations in the group – u i – u i-1 is the probability of sampling observation i from the group to replace missing data or synthesize data in observation j – conventional bootstrap, probability of sampling is 1/n

Creating Synthetic Data Begin with base data set that contains only non- missing values Use BB to complete missing administrative data – i.e. find donor SSN based on non-missing SIPP variables Use SRMI to complete missing SIPP data Iterate multiple times – input for iteration 2 is completed data set from iteration 1 On last iteration, run 4 separate processes to create 4 separate data sets or implicates

Creating Synthetic Data, Cont. Synthesis is like one more iteration of data completion, except all observations are treated as missing Each completed implicate serves as a separate input file Run 16 separate processes to create 16 different synthetic data sets or implicates The separate processes to create implicates have different stratification variables Need enough implicates to produce enough variation to ensure that averages across the implicates will be close to “truth”

Features of our Synthesizing Routines Parent-child relationships – foreign-born and decade arrive in US – welfare participation and welfare amount – presence of earnings, amount of earnings Restrictions on draws from PPD – Some draws must be within a pre-specified range from the original value: example MBA is +/- $50 of original value. – impose maximum and minimum values on some variables