INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson,

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

Issues in Designing a Confidentiality Preserving Model Server by Philip M Steel & Arnold Reznek.
Nonparametric estimation of non- response distribution in the Israeli Social Survey Yury Gubman Dmitri Romanov JSM 2009 Washington DC 4/8/2009.
1 State & County Characteristics: Overview The basics State –The general method –July 1, 2000 beginning population –Domestic migration IRS pre-processing.
1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
Fast Algorithms For Hierarchical Range Histogram Constructions
John M. Abowd Cornell University and Census Bureau
Two Applied Papers on Measurement Error in Wages Downward nominal wage flexibility– real or measurement error? Impact of Non-Classical Measurement Error.
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
© 2012 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
© John M. Abowd 2005, all rights reserved Household Samples John M. Abowd March 2005.
© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.
INFO 4470/ILRLE 4470 Social and Economic Data Populations and Frames John M. Abowd and Lars Vilhuber February 7, 2011.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
INFO 4470/ILRLE 4470 Register-based statistics by example: County Business Patterns John M. Abowd and Lars Vilhuber February 14, 2011.
Lecture II-2: Probability Review
1 Health Status and The Retirement Decision Among the Early-Retirement-Age Population Shailesh Bhandari Economist Labor Force Statistics Branch Housing.
1 The Business Register: Introduction and Overview Ronald H. Lee
INFO 7470/ILRLE 7400 Survey of Income and Program Participation (SIPP) Synthetic Beta File John M. Abowd and Lars Vilhuber April 26, 2011.
Growth Firms Project Chris Parsley, Manager Small Business Policy Branch Industry Canada From Data to Research for Policy OECD Growth Firms Meeting.
Economics and Statistics Administration U.S. CENSUS BUREAU U.S. Department of Commerce Comparing IRS Exemptions to 2010 Census Population Counts Esther.
Hypothesis Testing in Linear Regression Analysis
Microdata Simulation for Confidentiality of Tax Returns Using Quantile Regression and Hot Deck Jennifer Huckett Iowa State University June 20, 2007.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
Sampling. Concerns 1)Representativeness of the Sample: Does the sample accurately portray the population from which it is drawn 2)Time and Change: Was.
1 Business Register: Quality Practices Eddie Salyers
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
12th Meeting of the Group of Experts on Business Registers
Matthew S. Rutledge Research Economist Center for Retirement Research at Boston College 17th Annual Joint Meeting of the Retirement Research Consortium.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
Two Approaches to Calculating Correlated Reserve Indications Across Multiple Lines of Business Gerald Kirschner Classic Solutions Casualty Loss Reserve.
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
Topic (ii): New and Emerging Methods Maria Garcia (USA) Jeroen Pannekoek (Netherlands) UNECE Work Session on Statistical Data Editing Paris, France,
1 New Implementations of Noise for Tabular Magnitude Data, Synthetic Tabular Frequency and Microdata, and a Remote Microdata Analysis System Laura Zayatz.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
Panel Study of Entrepreneurial Dynamics Richard Curtin University of Michigan.
Small Area Health Insurance Estimates (SAHIE) Program Joanna Turner, Robin Fisher, David Waddington, and Rick Denby U.S. Census Bureau October 6, 2004.
1 Reengineering the SIPP: An Assessment of the Use of Administrative Records Jim Farber and Sally Obenski US Census Bureau CNSTAT Panel January 26, 2007.
Creating Something from Nothing: Synthetic and Dummy files Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta DLI Training: Ottawa,
Demographic Analysis Update This presentation is released to inform interested parties of research and to encourage discussion. Any views expressed.
WP 19 Assessment of Statistical Disclosure Control Methods for the 2001 UK Census Natalie Shlomo University of Southampton Office for National Statistics.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
Academic Research Academic Research Dr Kishor Bhanushali M
© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Disclosure Limitation in Microdata with Multiple Imputation Jerry Reiter Institute of Statistics and Decision Sciences Duke University.
Eurostat Accuracy of Results of Statistical Matching Training Course «Statistical Matching» Rome, 6-8 November 2013 Marcello D’Orazio Dept. National Accounts.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
The LEHD Program and Employment Dynamics Estimates Ronald Prevost Director, LEHD Program US Bureau of the Census
INFO 7470/ECON 7400/ILRLE 7400 Register-based statistics John M. Abowd and Lars Vilhuber March 4, 2013 and April 4, 2016.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Differentially Private Verification of Regression Model Results
Martha Stinson. T. Kirk White. James Lawrence
The European Statistical Training Programme (ESTP)
Classification Trees for Privacy in Sample Surveys
Chapter 13: Item nonresponse
Jerome Reiter Department of Statistical Science Duke University
Presentation transcript:

INFO 7470/ECON 7400 Synthetic Data Creation and Use John M. Abowd and Lars Vilhuber with a big assist from Abigail Cooke, Javier Miranda, Martha Stinson, and Kelly Trageser April 29, 2013

Outline SIPP Synthetic Data LBD Synthetic Data 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 2

SURVEY OF INCOME AND PROGRAM PARTICIPATION (SIPP) SYNTHETIC DATA 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 3

Survey of Income and Program Participation (SIPP) Goal of SIPP: accurate info about income and program participation of individuals and households and its principal determinants Information: – Cash and noncash income on a sub-annual basis. – Taxes, assets, liabilities – Participation in government transfer programs 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 4

Background In 2001, a new regulation authorized the Census Bureau and SSA to link SIPP and CPS data to SSA and IRS administrative data for research purposes Idea for a public use file was motivated by a desire to allow outside access to long administrative record histories of earnings and benefits linked to household demographic data These data allow detailed statistical and simulation study of retirement and disability programs Census Bureau, Social Security Administration, Internal Revenue Service, and Congressional Budget Office all participated in development 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 5

Genesis of the SSB A portion of the SIPP user community was primarily interested in national retirement and disability programs SIPP augmented with – earnings histories from the IRS data maintained at SSA (W-2) – benefit data from SSA’s master beneficiary records. Feasibility assessment (confidentiality!) of adding SIPP variables to earnings/benefit data in a public-use file (PUF) – set of variables that could be added without compromising the confidentiality protection of the existing SIPP public use files was VERY limited Alternative methods explored 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 6

SSB Basic Methodology Experiment using “synthetic data” In fact: partially synthetic data with multiple imputation of missing items Partially synthetic data: – Some (at least one) variables are actual responses – Other variables are replaced by values sampled from the posterior predictive distribution for that record, conditional on all of the confidential data 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 7

History of the SSB : Creation, but not release, of three versions of the “SIPP/SSA/IRS-PUF” (SSB) 2006: Release to limited public access of SSB V4.2 – Access to general public only at Cornell-hosted Virtual RDC (SSB server: restricted-access setup) With promise of evaluation of Virtual RDC-run programs on internal Gold Standard – Ongoing SSA evaluation – Ongoing evaluation at Census (in RDC) 2010: Release of SSB V5 at Census and on the Virtual RDC (codebook: )SSB V5 – Restructured to vastly improve analytical validity of SIPP variables 2013: Release of SSB V5.1 at Census and on the VirtualRDC (documentation in preparation) – First user-initiated variables 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 8

Basic Structure of the SSB V4 SIPP – Core set of 125 SIPP variables in a standardized extract of SIPP panels and 1996 – All missing data items (except for structurally missing) are marked for imputation IRS – Maintained at SSA, but derived from IRS records – Master summary earnings records (SER) – Master detailed earnings records (DER) 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 9

Basic Structure of the SSB V4 (II) SSA – Master Beneficiary Record (MBR) Census – Numident: administrative birth and death dates All files combined using verified SSNs => “Gold Standard” 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 10

Basic Structure of SSB V5 Panels: 1990, 1991, 1992, 1993, 1996, 2001, and 2004 (this variable is now in the SSB) Couple-level linkage: the first person to whom the SIPP respondent was married during the time period covered by the SIPP panel SIPP variables only appear in years appropriate for the panel indicated by the PANEL variable (biggest change from V4.2) Version 5.1: user-requested variables 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 11

Missing Values in the Gold Standard Values may be missing due to – [survey] Non-response – [survey] Question not being asked in a particular panel – [admin] Failure to link to administrative record (non- validated SSN) – [both] Structural missing (e.g., income of spouse if not married) All missing values except structural are part of the missing data imputation phase of SSB 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 12

Scope of the Synthesis Never missing and not synthesized – gender – marital status – spouse’s gender – initial type of Social Security benefits – type of Social Security benefits in 2000 – spouse’s benefits type variables All other variables in the public use file were synthesized 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 13

Common Structure to Multiple Imputation and Synthesis Hierarchical tree of variable relationships (parent-child relationship, accounting for structure) At each node, independent SRMI is used – Statistical model is estimated for each of the variables at the same level (one of): Bayesian bootstrap Logistic regression (with automatic Bayesian variable selection) Linear regression (with automatic Bayesian variable selection) – Statistical models are estimated separately for groups of individuals – Then, a proper posterior predictive distribution is estimated – Given a PPD, each variable is imputed /synthesized, conditional on all values of all other variables for that record The next node is processed 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 14

MI and Synthesis Initial iterations for missing data imputation, keeping all observed values where available Final iteration is for data synthesis (replacing all observed values, see exceptions) 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 15

Latest Release of SSB 2010: Release of limited public access of SSB V : Release of limited public access SSB V5.1 Both versions accessed via the VirtualRDC Synthetic Data Server 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 16

SIPP Variables Codebook 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 17

Synthetic Data Creation Purpose of synthetic data is to create micro- data that can be used by researchers in the same manner as the original data while preserving the confidentiality of respondents’ identities Fundamental trade-off: usefulness and analytical validity of data versus protection from disclosure Goal: not be able to re-identify anyone in the already released SIPP public use files while still preserving regression results 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 18

Multiple Imputation for Confidentiality Protection 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 19

Testing Analytical Validity Run regressions on each synthetic implicate – Average coefficients – Combine standard errors using formulae that take account of average variance of estimates (within implicate variance) and differences in variance across estimates (between implicate variance) Run regressions on gold standard data Compare average synthetic coefficient and standard error to gold standard coefficient and standard error Data are analytically valid if coefficient is unbiased and the same inferences are drawn 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 20

Formulae: Completed Data Only 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 21

Formulae: Total Variance and Between Variance 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 22

Formula: Within Variance 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 23

Formulae: Synthetic and Completed 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 24

Formulae: Grand Mean and Overall Variance 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 25

Formulae: Between Variances 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 26

Formulae: Within Variances 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 27

Example: Average AIME/AMW Estimate average on each of synthetic implicates – AvgAIME(1,1), AvgAIME(1,2), AvgAIME(1,3), AvgAIME(1,4), – AvgAIME(2,1), AvgAIME(2,2), AvgAIME(2,3), AvgAIME(2,4), – AvgAIME(3,1), AvgAIME(3,2), AvgAIME(3,3), AvgAIME(3,4), – AvgAIME(4,1), AvgAIME(4,2), AvgAIME(4,3), AvgAIME(4,4) Estimate mean for each set of synthetic implicates that correspond to one completed implicate – AvgAIMEAVG(1), AvgAIMEAVG(2), AvgAIMEAVG(3), AvgAIMEAVG(4) Estimate grand mean of all implicates – AvgAIMEGRANDAVG 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 28

Example (cont.) Between m implicate variance Between r implicate variance 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 29

Example (cont.) Variance of mean from each implicate – VAR[AvgAIME (1,1) ], VAR[AvgAIME (1,2) ], VAR[AvgAIME (1,3) ], VAR[AvgAIME (1,4) ] – VAR[AvgAIME (2,1) ], VAR[AvgAIME (2,2) ], VAR[AvgAIME (2,3) ], VAR[AvgAIME (2,4) ] – VAR[AvgAIME (3,1) ], VAR[AvgAIME (3,2) ], VAR[AvgAIME (3,3) ], VAR[AvgAIME (3,4) ] – VAR[AvgAIME (4,1) ], VAR[AvgAIME (4,2) ], VAR[AvgAIME (4,3) ], VAR[AvgAIME (4,4) ] Within variance 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 30

Example (cont.) Total Variance Use AvgAIMEGRANDAVG and Total Variance to calculate confidence intervals and compare to estimate from completed data 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 31

SAS Programs Sample programs to calculate total variance and confidence intervals 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 32

Results: Average AIME 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 33

Public Use of the SIPP Synthetic Beta Full version (16 implicates) released to the Cornell VirtualRDC Synthetic Data Server (SDS) Any researcher may use these data During the testing phase, all analyses must be performed on the Virtual RDC Census Bureau research team will run the same analysis on the completed confidential data Results of the comparison will be released to the researcher, Census Bureau, SSA, and IRS (after traditional disclosure avoidance analysis of the runs on the confidential data) 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 34

Methods for Estimating the PPD Sequential Regression Multivariate Imputation (SRMI) is a parametric method where PPD is defined as The BB is a non-parametric method of taking draws from the posterior predictive distribution of a group of variables that allows for uncertainty in the sample CDF We use BB for a few groups of variables with particularly complex relationships and use SRMI for all other variables 35

SRMI Method Details 36

SRMI Details: KDE Transforms The SRMI models for continuous variables assume that they are conditionally normal This assumption is relaxed by performing a KDE- based transform of groups of related variables All variables in the group are transformed to normality, then the PPD is estimated The sampled values from PPD are inverse transformed back to the original distribution using the inverse cumulative distribution 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 37

SRMI Example: Synthesizing Date of Birth Divide individuals into homogeneous groups using stratification variables – example: male, black, age categories, education categories, marital status – example: decile of lifetime earnings distribution, decile of lifetime years worked distribution, worked previous year, worked current year For each group, estimate an independent linear regression of date of birth on other variables (not used for stratification) that are strongly related 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 38

SRMI Example: Synthesizing Date of Birth Synthetic date of birth is a random variable Before analysis, it is transformed to normal using the KDE- based procedure Distribution has two sources of variation: – variation in error term in regression model – variation in estimated parameters:  ’s and  2 Synthetic values are draws from this distribution Synthetic values are inverse transformed back to the original distribution using the inverse cumulative distribution 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 39

Bayesian Bootstrap Method Details Divide data into homogeneous groups using similar stratification variables as in SRMI Within groups do a Bayesian bootstrap of all variables to be synthesized at the same time. – n observations in a group, draw 1-n random variables from uniform (0,1) distribution – let u o … u i … u n define the ordering of the observations in the group – u i – u i-1 is the probability of sampling observation i from the group to replace missing data or synthesize data in observation j – conventional bootstrap, probability of sampling is 1/n 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 40

Creating Synthetic Data Begin with base data set that contains only non- missing values Use BB to complete missing administrative data – i.e. find donor SSN based on non-missing SIPP variables Use SRMI to complete missing SIPP data Iterate multiple times – input for iteration 2 is completed data set from iteration 1 On last iteration, run 4 separate processes to create 4 separate data sets or implicates 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 41

Creating Synthetic Data, Cont. Synthesis is like one more iteration of data completion, except all observations are treated as missing Each completed implicate serves as a separate input file Run 16 separate processes to create 16 different synthetic data sets or implicates The separate processes to create implicates have different stratification variables Need enough implicates to produce enough variation to ensure that averages across the implicates will be close to truth 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 42

Features of Synthesizing Routines Parent-child relationships – foreign-born and decade arrive in US – welfare participation and welfare amount – presence of earnings, amount of earnings Restrictions on draws from PPD – Some draws must be within a pre-specified range from the original value: example MBA is +/- $50 of original value. – impose maximum and minimum values on some variables 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 43

External Researcher Validation Version 4.0 – 12 projects – 1 was submitted for validation Version 5.0 – 31 projects – 6 were submitted for validation 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 44

Validation Details Henriques, Alice (2102) “How does Social Security claiming respond to incentives? Considering husbands’ and wives’ benefits separately” Armour, Philip (2012) “The role of information in disability insurance take-up: An analysis of the Social Security statement phase-in” Bertrand, Marianne, Emir Kamenica and Jessica Pan, “Gender identity and relative income within households” 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 45

From Bertrand et al. 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 46 Timeline: SDS application November 2012, gold standard results January 2013

SYNTHETIC LONGITUDINAL BUSINESS DATABASE 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 47

48 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS Synthetic Data Workshop [link] link Kinney/Reiter/Jarmin/Miranda/Reznek/Abowd (2011) “Towards Unrestricted Public Use Microdata: The Synthetic Longitudinal Business Database.”, CES-WP-11-04Towards Unrestricted Public Use Microdata: The Synthetic Longitudinal Business Database Work on the Synthetic LBD was supported by NSF Grant ITR , and ongoing work is supported by the Census Bureau. A portion of this work was conducted by Special Sworn Status researchers of the U.S. Census Bureau at the Triangle Census Research Data Center. Research results and conclusions expressed are those of the authors and do not necessarily reflect the views of the Census Bureau. Results have been screened to ensure that no confidential data are revealed. 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

Overview LBD background Synthetic data generation Analytic validity Confidentiality protection Future plans 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 49

Elements 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 50 (Economic Surveys and Censuses) Issue: (item) non- response Solution: LBD (Business Register) Issue: inexact link records Solution: LBD Match-merged and completed complex integrated data Issue: too much detail leads to disclosure issue Solution: Synthetic LBD Public-use data With novel detail Novel analysis using Public- use data with novel detail Issue: are the results right Solution: Early release/SDS

The Real LBD Economic census covering nearly all private non-farm business establishments with paid employees – Contains: Annual payroll and Mar 12 employment ( ), SIC/NAICS, Geography (down to county), Entry year, Exit year, Firm structure Used for looking at business dynamics, job flows, market volatility, international comparisons… 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 51

Longitudinal Business Database (LBD) Detailed description in Jarmin and Miranda Developed as a research dataset by the U.S. Census Bureau Center for Economic Studies Constructed by linking annual snapshot of the Census Bureau’s Business Register (see Lecture 4) 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 52

Longitudinal Business Database II CES constructed Longitudinal linkages (using probabilistic record linking, see Lecture 10) Re-timed multi-unit births and Edits and imputations for missing data 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 53

Access to the LBD Different levels of access Public use tabulations – Business Dynamics Statistics “Gold Standard” confidential micro-data available through the Census Research Data Center (RDC) Network – Most used dataset in the RDCs 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 54

Bridge between the Two Synthetic data set – Available outside the Census RDC – Providing as much analytical validity as possible – Reduce the number of requests for special tabulations – Aid users requiring RDC access Experiment in public use business micro-data 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 55

Why Synthetic Data? Concerns about confidentiality protection for census of establishments – LBD is a test case for business data Criteria given for public release: – No actual values of confidential values could be released – Should provide valid inferences while protecting confidentiality 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 56

Generic Structure Gold standard: given by internal LBD (already completed) Partially synthetic: – Unsynthesized: County (but not released!) [x1] SIC [x2] – Synthesized Birth [y1] and death [y2] year: Multi-unit status [y3] Employment (March 12) [y4] Payroll [y5] 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 57

Synthesis: General Approach Y=[y1|y2|y3|y4|y5] X=[x1|x2] Generate joint distribution of Y|X by sampling from conditionals – f(y1,y2,y3|X) = f(y1|X)·f(y2|y1,X)·f(y3|y1,y2,X) Use SIC as “by group” 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 58

General Approach to Synthesis Drawing from f(yk|X,y1,...,yk-1) – Fit model using observed data – Draw new values of parameters from posterior distributions – Use new parameters to predict yk from X and synthetic values of y1,...,yk-1 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 59

The Sequential Regression Multivariate Imputation (SRMI) Approach Calendar: – Step1: Impute y1 | X – Step 2: Impute y2 | [y1| f(X)] Where f(X) uses state [x1’] instead of county [x1] Type of firm – Step 3: Impute y3 | [y1|y2|X] Characteristics – Step 4: Impute y4(t)|[y1|y2|y3|y4(t-1)|x2] – Step 5: Impute y5(t)|[y1|y2|y3|y4(t)|y5(t-1)|x2] 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 60

First Year Impute y1 (Firstyear) | SIC, County using variant of Dirichlet-Multinomial – Prior information is obtained by collapsing categories – Synthetic values obtained from sampling from multinomial distribution 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 61

Last Year Impute y2 (Last Year)| First Year, State, SIC Simple multinomial approach – Dirichlet-multinomial with flat prior – Sample from multinomial probabilities obtained from matching categories in observed data 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 62

Multi-unit Status Impute in two stages: – Categorical response: Always MU, sometimes MU, never MU – Imputed using simple multinomial approach Given change in status occurs, impute when change occurred (future) 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 63

Employment and Payroll Highly skewed longitudinal continuous variables Imputed using a set of normal linear models with kde transformation of response (Abowd and Woodcock, 2004) Impute year by year, employment and then payroll, based on groups – (3-digit SIC) – by (multiunit status) – by (continuer status) – by (top 5% status) If model too sparse, use 2-digit SIC as prior 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 64

Analytical Validity Tests Compare observed data and synthetic data for whole LBD Job creation and destruction Employment volatility Gross employment levels 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 65

664/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

674/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

684/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

694/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

704/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

714/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

Confidentiality Protection Unavailable in SynLBD V2 (current on SDS) – Firm structure – Firm linkages (across time, across implicates) – Geography Basic protection – Replacing sensitive values of with draws from probability distributions 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 76

Disclosure Avoidance Review High probability that an individual establishment’s synthetic birth/death year is different from its actual birth/death year Synthetic maxima not necessarily near actual High between-imputation variability at establishment level 774/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

Synthesizing Firstyear (Birth) and Lastyear (Death) Positive probability exists of producing any feasible birth year, and substantial probability exists that synthesized firstyear is not the actual firstyear Table on next slide shows this: prob(actual birth year=synthetic birth year l synthetic birth year) is low Similar results hold for deaths Conclusions: establishment lifetimes are random, so users can’t accurately attach establishment identifications to them 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 78

4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 79

Example: Year of birth

Confidentiality Protection: Breaking Firm Links Firm characteristics not synthesized Firm characteristics more skewed than establishment characteristics Cannot link multi-unit establishments to their firms 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 81

Confidentiality Protection: Breaking Links Across Implicates Synthetic observations with the same LBDnum across implicates are not generated from the same LBD establishment Can’t group (across implicates within year) observations generated from same establishment 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 82

Confidentiality Protection: Synthesizing Employment and Payroll Synthesis models are essentially regressions with transformed variables Synthesis captures low-dimensional relationships and sacrifices higher- dimensional ones Synthesized employment and payroll vary substantially around regression lines Synthesized employment and payroll vary significantly from observed values 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 83

Example: Correlations Among Actual and Synthetic Data SIC year /29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 84 Slide 84

4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 85

4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 86

Conclusions Analytical validity supported for broad analyses – Issues with some details – Obtain user feedback to inform future refinements Sufficient confidentiality protection – Basic metrics show strong protection – Differential privacy protection not yet verified 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 87

88 – Include NAICS, geography, changes in multiunit status, firm age and size – Multiple Imputations for release – Address bias in job creation/destruction – Extend time series Ongoing Work at Census 4/29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved

External Validation Exercises 41 approved projects (includes provisional approvals) 3 have submitted results for validation (one of these did two rounds of validation) Moscarini timeline: application approved March 2011, validation results released September /29/2013 © John M. Abowd and Lars Vilhuber 2013, all rights reserved 89