Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro,

Slides:



Advertisements
Similar presentations
Evaluating the Effects of Business Register Updates on Monthly Survey Estimates Daniel Lewis.
Advertisements

Estimating the Level of Underreporting of Expenditures among Expenditure Reporters: A Further Micro-Level Latent Class Analysis Clyde Tucker Bureau of.
The Nature of the Bias When Studying Only Linkable Person Records: Evidence from the American Community Survey Adela Luque (U.S. Census Bureau) Brittany.
The Microdata Analysis System (MAS): A Tool for Data Dissemination Disclaimer: The views expressed are those of the authors and not necessarily those of.
Nonresponse Bias Correction in Telephone Surveys Using Census Geocoding: An Evaluation of Error Properties Paul Biemer RTI International and University.
Mark D. Reckase Michigan State University The Evaluation of Teachers and Schools Using the Educator Response Function (ERF)
1 The Synthetic Longitudinal Business Database Based on presentations by Kinney/Reiter/Jarmin/Miranda/Reznek 2 /Abowd on July 31, 2009 at the Census-NSF-IRS.
© Federal Statistical Office Germany, IV A2 Federal Statistical Office Germany Application of Regular Expressions in the German Business Register Session.
Research developments at the Census Bureau Roderick J. Little Associate Director for Research & Methodology and Chief Scientist Bureau of the Census.
Sampling Strategy for Establishment Surveys International Workshop on Industrial Statistics Beijing, China, 8-10 July 2013.
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
© 2007 John M. Abowd, Lars Vilhuber, all rights reserved Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2007.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Villalonga (2004) Lang and Stulz (1994), Berger and Ofek (1995), and Servaes (1996) find that diversified firms trade at an average discount relative to.
Chapter 7 – K-Nearest-Neighbor
© John M. Abowd 2005, all rights reserved Household Samples John M. Abowd March 2005.
© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.
Bridging the Gaps: Dealing with Major Survey Changes in Data Set Harmonization Joint Statistical Meetings Minneapolis, MN August 9, 2005 Presented by:
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
© John M. Abowd and Lars Vilhuber 2005, all rights reserved Introduction to Probabilistic Record Linking John M. Abowd and Lars Vilhuber March 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
© John M. Abowd 2005, all rights reserved Using the Economic Census and Business Register John M. Abowd February 2005.
FINAL REPORT: OUTLINE & OVERVIEW OF SURVEY ERRORS
The Impact of Hours Flexibility on Career Employment, Bridge Jobs, and the Timing of Retirement Kevin E. Cahill Sloan Center on Aging & Work at Boston.
Summer Working Group for Employer List Linking (SWELL) May 22, 2014: NCRN workshop1 Graton Gathright, Mark Kutzbach, Kristin Mccue, Erika McEntarfer, Holly.
Survey Experiments. Defined Uses a survey question as its measurement device Manipulates the content, order, format, or other characteristics of the survey.
Use of administrative data in statistics - challenges and opportunities ICES III End Panel Discussion Montreal, June 2007 Heli Jeskanen-Sundström Statistics.
Estimating social inequalities in Healthy Life Years in Belgium Estimating social inequalities in HLE: Challenges and opportunities 10 February, 2012 Rana.
New Census Bureau Data for Entrepreneurship Research Ron S Jarmin US Census Bureau OECD November 19, 2007 This report is released to inform interested.
Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,
INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011.
11 Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester,
Effect of Cost-Sharing on Screening Mammography in Medicare Managed Care Plans Amal Trivedi, MD, MPH William Rakowski, PhD John Ayanian, MD, MPP 2007 AcademyHealth.
Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.
12th Meeting of the Group of Experts on Business Registers
1 Supplementing ACS: The LEHD Program Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005 Jeremy S. Wu Marc Roemer U.S. Census Bureau May 12, 2005.
DATA MINING FINAL REPORT Vipin Saini M 許博淞 M 陳昀志 M
© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
The relationship between error rates and parameter estimation in the probabilistic record linkage context Tiziana Tuoto, Nicoletta Cibella, Marco Fortini.
1 An Evaluation of Changes to the Universe Extraction for Current Business Surveys at the U.S. Census Bureau Author: Carol S. King Presenter: Ruth E. Detlefsen.
Panel Study of Entrepreneurial Dynamics Richard Curtin University of Michigan.
The Conditional Independence Assumption in Probabilistic Record Linkage Methods Stephen Sharp National Records of Scotland Ladywell Road Edinburgh EH12.
MCRDC Michigan Census Research Data Center The MCRDC is a joint project of the U.S. Bureau of the Census and the University of Michigan to enable qualified.
Sampling Neuman and Robson Ch. 7 Qualitative and Quantitative Sampling.
Editing of linked micro files for statistics and research.
© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
The challenge of a mixed-mode design survey and new IT tools application: the case of the Italian Structure Earning Surveys Fabiana Rocci Stefania Cardinleschi.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
1 1/5/2016 The Link between Individual Expectations and Savings: Do nursing home expectations matter? Kristin J. Kleinjans, University of Aarhus & RAND.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
Kevin A Henry, Ph.D New Jersey Cancer Registry Cancer Epidemiology Services Frank Boscoe, Ph.D New York State Cancer Registry Estimating the accuracy of.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Looking for statistical twins
Challenges in data linkage: error and bias
Introduction to Probabilistic Record Linking
Creation of synthetic microdata in 2021 Census Transformation Programme (proof of concept) Robert Rendell.
Discrete Event Simulation - 4
Martha Stinson. T. Kirk White. James Lawrence
An Introduction to Automated Record Linkage
Presentation transcript:

Developing job linkages for the Health and Retirement Study John Abowd, Margaret Levenstein, Kristin McCue, Dhiren Patki, Ann Rodgers, Matthew Shapiro, Nada Wasi NCRN Spring 2016 Meeting, May 10, 2016

This research is supported by the Alfred P. Sloan Foundation through the Census-HRS project at the University of Michigan with additional support from the Michigan Node of the NSF-Census Research Network (NCRN) under NSF SES This research uses data from the Census Bureau's Longitudinal Employer- Household Dynamics Program, which was partially supported by the following National Science Foundation Grants SES , SES and ITR ; National Institute on Aging Grant AG018854; and grants from the Alfred P. Sloan Foundation. Any opinions and conclusions expressed herein are those of the authors and do not necessarily represent the views of the U.S. Census Bureau. All results have been reviewed to ensure that no confidential information is disclosed. Acknowledgements and disclaimers 2

Outline  Background on HRS, CenHRS  Approach to linkage  Work using a small set of HRS jobs  Some preliminary results  Challenges

37,000 + Americans over the age of 50  Surveyed every two years since 1992  New cohorts added in 1993, 1998, 2004, 2010, 2016  Includes both spouses  Oversamples minorities  Follows respondents through death 4

5

UMichigan/Cornell/Census collaboration Goal: New info on HRS respondents in employer and co-worker context Develop new data infrastructure:  HRS-BR Crosswalk  New measures of employer characteristics  Enhance HRS public-use datasets Census-Enhanced HRS 6

Linkage Process Flow 7 HRS Business Register Blocked Pairs File of Candidate Matches Standardize Names/addresses, Calculate Comparators Create training set using human review Train Matching Model Predict Match Scores Analysis File with Multiply imputed links

First steps:  Use a subset of 1992 HRS private-sector jobs, 1992 BR to work out methods  Block on:  10-digit phone number, where possible  3-digit zip code, otherwise  Standardize address and name fields, using rules developed specifically for business names  Compute Jaro-Winkler string comparator scores for names and addresses

Construct set of pairs  1, HRS jobs from 7 states  Exclude if missing employer name or state, or missing both zip3 and phone # (10%)  <10% of phone numbers successfully blocked  Almost always at least 1 BR entry in zip3 block

Initial set of blocked pairs  All possible within-block pairs = 18.3M  JW scores comparing name, address  Stratify using 4x4 cross-classification of JW scores  Mean pairs per sampled HRS job=3,100, but varies from 1 to 20,000 across bins.  Lowest JW scored bin accounts for:  98% of pairs blocked on 3-digit zip  42% of those blocked on 10-digit phone number

Creating training set  Sample 100 pairs from each stratum  Each sampled pair reviewed by >=2 reviewers  Reviewers see 1 pair at a time  Assign separate scores for firm, establishment  Score as follows: 1=Yes, match 2=Probably match 3 =Maybe-maybe not 4=Probably not match 5=Not match 6=Not enough information

Results of review  3,400 reviews, 7 reviewers  Disagreement across reviewers:  5% for yes/no reviews  63% for maybe/not enough info  Use only yes/no reviews in estimating model (3,100) Match?EstablishmentFirm Yes10%18% Maybe13%11% No76%71% Not enough info<1%

Match rates by blocking factor Share of reviews scored as match Blocked onEstablishmentFirm 10-digit phone number 94%100% 3-digit zipcode11%19% Note: Reviews scored Probably match, maybe/maybe not, probably not match, or not enough information are excluded from denominator.

Model propensity for record from HRS to match record from the BR  Estimate model parameters using training set  Calculate agreement probability for all possible pairs within block Multiply impute links using agreement probabilities Modeling approach 14

Training our matching model  Using logistic model: dep var = 1 if pair is scored as a match, 0 otherwise  Regressors: splines of continuous variables, indicators, and a full set of interactions  To limit overfitting and to minimize out of sample error, we use elastic net shrinkage (Zou and Hastie, 2005)  Elastic net shrinkage reduces the dimensionality of the covariate vector  Idea: the optimal set of covariates is chosen to minimize cross-validated test error

Available model covariates  JW scores for agreement of name, address fields  Employment for establishment/employer for categories: 0/missing, 1-4, 5-14, 15-24,25-99,  Agreement on 3-digit, 5-digit zip code  Agreement on industry—2 digit SIC  Whether BR record is for single- or multi-unit  Whether HRS employer offers health insurance/pension  Business density—number of establishments in tract or per square mile

False positive rate True positive rate

Distribution of maximum predicted probability using only JW scores

Challenges  What to do when block does not include any high probability matches?  Possible reasons  Blocking strategy excluded correct match  Blocking didn’t fail:  Model failure  HRS information too garbled to support matching