An Introduction to the SIPP Synthetic Beta Holly Monti and Lori Reeder University of Michigan August 8,

An Introduction to the SIPP Synthetic Beta Holly Monti and Lori Reeder University of Michigan August 8, 2016 1

Outline  Background  Multiple Imputation Approach  How to use the SIPP Synthetic Beta (SSB)  SSB Research  Future Plans  SSB data sources (if time permits) 2

SSB Background  February 2001: Federal Regulation published authorizing sharing of data items from W-2 tax forms for the purpose of improving survey products  More demand to access survey data combined with administrative records  Meyer, Mok, & Sullivan 2015, Household Surveys in Crisis  Card, Chetty, Feldstein, & Saez 2012, Expanding Access to Administrative Data for Research in the United States  Can we make such a product easier to access? 3

SSB Background  Long histories of earnings and benefits data from IRS and SSA for all SIPP respondents with approval to link  Several SIPP panels: currently 1984, 1990, 1991, 1992, 1993, 1996, 2001, 2004, and 2008  Stack SIPP panels and link to administrative records. Named this the SIPP Gold Standard File (GSF) 4

Data Sources: Census Bureau  SIPP panels  1984, 1990, 1991, 1992, 1993, 1996, 2001, 2004, 2008  Crosswalks  Links SIPP identifiers to PIK (privatized SSN) by panel  Not all individuals link  Match/PIK rate by SIPP panel:  1984 =75%  1990s = 78%-84%  2001 = 47%  2004 = 72%  2008 = 82%  2014 = 89% 5

Data Sources: Administrative Records  Summary Earnings Record (SER) 1951-2011  Annual total FICA covered earnings  Annual total covered quarters (i.e. credits needs to qualify for benefits)  Detailed Earnings Record Extract (DER) 1978-2011  Uncapped earnings from the W-2, Form 1040 schedule C and SE  SSB variables: sum total earnings across all jobs  Annual total FICA wages from the DER  Annual total non-FICA wages from the DER  Annual total deferred FICA wages from the DER  Annual total deferred non-FICA wages from the DER 6

Data Sources: Administrative Records  Master Beneficiary Record (MBR) 1962-2012  Old Age, Survivor, Disability Insurance (OASDI) program  MBR variables on the SSB  Retirement, widow/spouse, and aged spouse benefits  Whether a benefit was received  Start date of benefit; Total monthly amount of benefit  Payment History Update System (PHUS) 1984-2012  Records actual payments made to beneficiaries  SSB variables:  Dummy variable for receipt  Benefit start date  Total amount of benefit 7

Data Sources: Administrative Records  Social Security Disability Insurance (SSDI)  From MBR file  Date of disability onset  Disability adjudication date  Date of entitlement to disability  Date of disability benefits cessation  Total benefit amount  Disability diagnosis code  Up to 4 disability applications: always kept first ever, last ever, first in SIPP panel, last in SIPP panel; if more applications than these, delete older rejections, then older accepts. 8

Data Sources: Administrative Records  Supplemental Security Record (SSR): 1974-2012  Supplemental Security Income (SSI) program  Slightly less complicated file with only 6500+ variables  SSB variables:  Dummies for applied, received, or ceased receiving benefits  Application date  Amount received  Type of benefit  First/last payments  Diagnosis type 9

Data Sources Visualization 10

How to Make This Product Accessible?  Limited number of internal users, and can be difficult for external users to get RDC access especially for projects with IRS data.  Decision to pursue a new SIPP public use file that contains the administrative data linked to the survey data  Challenge: Confidentiality  Solution: Data Synthesis 11

Data Synthesis  Underlying data presents two problems  Missing Values  Disclosure Risk  We attempt to solve both problems with multiple imputation  Estimate joint distribution of all variables  Take multiple draws from estimated distribution to fill in missing values resulting in multiple, completed data sets (completed implicates)  Take multiple draws from estimated distribution to replace sensitive values resulting in multiple, synthetic data sets (synthetic implicates) 12

Why Multiple Imputation?  Easy for analyst to use  On any given implicate, analyst can use any estimation technique that one would use on a dataset with no missing data  Data user only needs to know how to generate the estimand of interest and its measure of uncertainty on a single dataset, and then plugs the multiple estimates into a simple formula (which we provide)  Variance estimation in multiple imputation setting works with almost any conceivable analysis (is not tailored to a specific estimand)  Provides a very natural way of uncovering true variance 13

Multiple Imputation Visual 14 Start with the GSF Could handle missing data by simply dropping records with missing values: list-wise deletion Potentially skewing sample and throwing away a lot of non- missing data

Multiple Imputation Visual 15 Single Imputation of missing data If user treats imputations the same as non-missing data – too confident in results: variance biased down No obvious way to correct variance

Multiple Imputation Visual 16 Multiple imputations Only difference in estimates between implicates comes from different imputations; thus revealing the uncertainty of the imputation This variation allows user to correctly adjust variance and still use all non-missing data

Multiple Imputation Visual 17 Synthesize each completed implicate once Two sources of uncertainty in variation between synthetic implicates: missing data imputation and synthesis Can not distinguish and estimates will overstate the variance (Reiter)

Multiple Imputation Visual 18 Multiple Synthetic implicates for each completed implicate Can measure uncertainty from synthesis AND uncertainty from missing data imputation Adjust variance without overstating uncertainty from missing data imputation

Release History  Version 4: Spring 2007 1990, 1991, 1992, 1993, and 1996 panels People age 15+ Gender, spouse link, OASDI benefit not synthesized  Version 5: Fall 2010 Add 2001, 2004 panels  Version 5.1: May 2013 Corrections for error in underlying earnings data Add SIPP arrays, state Synthesize OASDI benefit  Version 6.0: March 2015 Added 1984, 2008 panels All individuals regardless of age included Added disability application history variables and SSI history Base weight added 19

Estimating the joint distribution 20 Var1Var2Var3Var4 X XX XXX XXXX

Missing Data Pattern Usually Not Monotone Var1Var2Var3Var4 XXX XX XX XX XXX XX X XX XXX 21

Sequential Regression Multivariate Imputation (SRMI) 22

SRMI Visual for a Single Record Iteration #Var1Var2Var3Var4 1XX 2XX 3XX 23 Iteration #Var1Var2Var3Var4 1XXX 2XX 3XX Iteration #Var1Var2Var3Var4 1XXXX 2XX 3XX Iteration #Var1Var2Var3Var4 1XXXX 2XXX 3XX Iteration #Var1Var2Var3Var4 1XXXX 2XXXX 3XX Iteration #Var1Var2Var3Var4 1XXXX 2XXXX 3XXX Iteration #Var1Var2Var3Var4 1XXXX 2XXXX 3XXXX

Synthesize Data 24

Start with a completed implicate Var1Var2Var3Var4 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX 25 Var1Var2Var3Var4 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX Var1Var2Var3Var4 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX Var1Var2Var3Var4 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX Var1Var2Var3Var4 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX Estimate p(x1) & impute Estimate p(x2|x1) & impute Estimate p(x3|x1,x2) & impute Estimate p(x4|x1,…,x3) & impute

Using the SSB: Getting Started  Go to: http://www.census.gov/programs-surveys/sipp/guidance/sipp- synthetic-beta-data-product.html  Codebook  Technical Documentation  Application  Submit application  Synthetic Data Server  We support code written in SAS and STATA  Other software is available, and may be used with understanding of less support 26

Using the SSB: Requesting Validation on Internal Data  Census will run your programs on the confidential data and release these results to you  Summary of steps  Write and test code on synthetic data  Prepare memo describing programs and output  Email memo to Census SSB staff and request validation and disclosure review  Can take as little as 2-3 weeks  Only works as well as your code 27

Topics that can be studied  Wage Inequality  Retirement patterns  Disability applications  Immigrant outcomes  Education outcomes  Fertility history questions  Spouse behavior 28

SSB Users  125 of SSB users since 2007; 40 new users in the last year  Several dissertations  Michael Carr and Emily Wiemers received a Russsel Sage Foundation grant for work with SSB on earnings instability and variability  2016 LERA/ASSA Session Title: Data Gold! Exploiting the rich research potential of lifetime earnings and Social Security benefits data from the U.S. Census Bureau’s SIPP Synthetic Beta and Gold Standard files 29

SSB 2016 LERA/ASSA Session  Rutledge, Wu, & Vitagliano  Studies whether catch-up provision tax incentives effectively increased earnings deferrals.  Shore-Sheppard  Studies patterns of earnings prior to birth of 1 st child and compares these patterns between varying levels of educational attainment of parents.  Carr & Wiemers  Study trends in earnings volatility by educational attainment and gender over the last thirty-five years.  Wicks-Lim  Studies prevalence and duration of low-wage careers across a changing labor market. 30

Bertrand, Kamenica, & Pan (QJE 2015) 31

Future Approach  Issues  Hard to add new variables to SSB  SSB Users have no choice over missing data approach  Solution: Synthesize incomplete data directly  Missing data pattern is synthesized  User can choose how to handle missing data  Faster to add new variables to synthetic file  Fewer implicates in total (r synthetic implicates, instead of r synthetic implicates for each of m completed implicates) 32

SSB Workshop Project  Mincer Earnings Regression  Select Sample of working age males from 1996 panel for whom we are likely to have observed full education  Table of descriptive statistics on sample  Simple Mincer Earnings Regression  Add year fixed effects 33

SSB Workshop Project  Start with one implicate  Once working properly, loop over all 16 implicates storing results for each  Combine results to get proper point and variance estimates for estimands of interest  Test combination code for completed data on subset of 4 synthetic implicates 34

An Introduction to the SIPP Synthetic Beta Holly Monti and Lori Reeder University of Michigan August 8,

Similar presentations

Presentation on theme: "An Introduction to the SIPP Synthetic Beta Holly Monti and Lori Reeder University of Michigan August 8,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Introduction to the SIPP Synthetic Beta Holly Monti and Lori Reeder University of Michigan August 8,

Similar presentations

Presentation on theme: "An Introduction to the SIPP Synthetic Beta Holly Monti and Lori Reeder University of Michigan August 8,"— Presentation transcript:

Similar presentations

About project

Feedback