INFO 7470 Statistical Tools: Edit and Imputation

Slides:

Advertisements

Similar presentations

Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.

Advertisements

What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.

© John M. Abowd 2005, all rights reserved Household Samples John M. Abowd March 2005.

INTERPRET MARKETING INFORMATION TO TEST HYPOTHESES AND/OR TO RESOLVE ISSUES. INDICATOR 3.05.

© John M. Abowd 2005, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2005.

INFO 4470/ILRLE 4470 Social and Economic Data Populations and Frames John M. Abowd and Lars Vilhuber February 7, 2011.

© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.

© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.

Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.

Census Bureau Employment Data ACS, EC, and LED… And why you should use the data from one program vs. another… SDC/CIC Annual Training Conference Wednesday,

Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.

INFO 4470/ILRLE 4470 Register-based statistics by example: County Business Patterns John M. Abowd and Lars Vilhuber February 14, 2011.

Household Surveys ACS – CPS - AHS INFO 7470 / ECON 8500 Warren A. Brown University of Georgia February 22,

INFO 7470/ILRLE 7400 Statistical Tools: Missing Data Methods John M. Abowd and Lars Vilhuber March 15, 2011.

Introduction to Record Linking John M. Abowd and Lars Vilhuber April 2011 © 2011 John M. Abowd, Lars Vilhuber, all rights reserved.

1 Business Register: Quality Practices Eddie Salyers

12th Meeting of the Group of Experts on Business Registers

INFO 7470/ILRLE 7400 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber March 25, 2013.

Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.

Chapter Thirteen Validation & Editing Coding Machine Cleaning of Data Tabulation & Statistical Analysis Data Entry Overview of the Data Analysis.

© John M. Abowd 2007, all rights reserved Analyzing Frames and Samples with Missing Data John M. Abowd March 2007.

1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.

Current Population Survey Sponsor: Bureau of Labor Statistics Collector: Census Bureau Purpose: Monthly Data for Analysis of Labor Market Conditions –CPS.

MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.

1 Introduction to Survey Data Analysis Linda K. Owens, PhD Assistant Director for Sampling & Analysis Survey Research Laboratory University of Illinois.

Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.

DATA PREPARATION: PROCESSING & MANAGEMENT Lu Ann Aday, Ph.D. The University of Texas School of Public Health.

Statistical Expertise for Sound Decision Making Quality Assurance for Census Data Processing Jean-Michel Durr 28/1/20111Fourth meeting of the TCG - Lubjana.

© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.

© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.

Household Surveys: American Community Survey & American Housing Survey Warren A. Brown February 8, 2007.

Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census.

Chapter 6: Analyzing and Interpreting Quantitative Data

Tutorial I: Missing Value Analysis

INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.

INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.

Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.

INFO 7470 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber April 11, 2016.

INFO 7470/ECON 7400/ILRLE 7400 Register-based statistics John M. Abowd and Lars Vilhuber March 4, 2013 and April 4, 2016.

INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.

INTRODUCTION TO RESEARCH METHODS IN ECONOMICS Topic 5 Data Collection Strategies These slides are copyright © 2010 by Tavis Barr. This work is licensed.

Weighting and imputation PHC 6716 July 13, 2011 Chris McCarty.

No Free Lunch: Working Within the Tradeoff Between Quality and Privacy

Census Data-Strictly Business?:

John M. Abowd and Lars Vilhuber February 16, 2011

Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides

Panel on Indian Economic Development, Labor and Population Data

Introduction to Survey Data Analysis

PRODUCTION PROCESS AND FLOW

Chapter Three Research Design.

How to handle missing data values

Chapter Eight: Quantitative Methods

IPUMS CPS Summer Data Workshop June 4, 2018 Kari Williams

Survey phases, survey errors and quality control system

Multi-Mode Data Collection Approach

Warm up – Unit 4 Test – Financial Analysis

Presenter: Ting-Ting Chung July 11, 2017

Survey phases, survey errors and quality control system

The European Statistical Training Programme (ESTP)

Multi-Mode Data Collection Approach

New Techniques and Technologies for Statistics 2017 Estimation of Response Propensities and Indicators of Representative Response Using Population-Level.

Field procedures and non-sampling errors

Chapter 9 Hypothesis Testing: Single Population

Treatment of Missing Data Pres. 8

Changes in the Canadian Census of Population Program

Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.

Multi-Mode Data Collection

Chapter 13: Item nonresponse

Presentation transcript:

INFO 7470 Statistical Tools: Edit and Imputation John M. Abowd and Lars Vilhuber April 11, 2016 (updated November 3, 2017)

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Outline Why learn about edit and imputation procedures Formal models of edits and imputations Missing data overview Missing records Frame or census Survey Missing items Overview of different products Overview of methods Formal multiple imputation methods Examples April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Why? Users of public-use data can normally identify the existence and consequences of edits and imputations, but don’t have access to data that would improve them Users of restricted-access data normally encounter raw files that require sophisticated edit and imputation procedures in order to use effectively Users of integrated (linked) data from multiple sources face these problems in their most extreme form April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Formal Edit and Imputation Models Original work on this subject can be found in Fellegi and Holt (1976) One of Fellegi’s many seminal contributions Recent work by Winkler (2008) Formal models distinguish between edits (based on expert judgments) and imputations (based on modeling) State of the art: Manrique-Vallier and Reiter (2014) and Murray and Reiter (JASA, forthcoming) April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Definition of “Edit” Checking each field (variable) of a data record to ensure that it contains a valid entry Examples: NAICS code in range; 0≤age<120 Checking the entries of specified fields (variables) to ensure that they are consistent with each other Examples: job creations – job destructions = accessions – separations; age = data_reference_date – birth_date April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Options When There Is an Edit Failure Check original files (written questionnaires, census forms, original administrative records) Contact reference entity (household, person, business, establishment) Clerically “correct” data (using expert judgment, not model-base) Automatic edit (using a specified algorithm) Delete record from analysis/tabulation April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Direct Edits Are Expensive Even when feasible, it is expensive to cross-check every edit failure with original source material It is extremely expensive to re-contact sources, although some re-contacts are built into data collection budgets Computer-assisted survey/census methods can do these edits at the time of the original data collection Re-contacting or revisiting original source material is usually either infeasible or prohibited for administrative records in statistical use April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Example: Age on the SIPP The SIPP collects both age and birth date (as do the decennial census and ACS) These are often inconsistent; an edit is applied to make them consistent When linked to SSA administrative record data, the birth date on the SSN application becomes available It is often inconsistent with the respondent report And both birth dates are often inconsistent with the observed retirement benefit status of a respondent (also in administrative data) Why? April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

The SSA Birth Date Is Respondent Provided until Benefits are Claimed Prior to 1986, most Americans received SSNs when they entered the labor force not before their first birthday, as is now the case The birth date on SSN application records is “respondent” provided (and not edited by SSA) Further updates to the SSA “Numident” file, which is the master data base of SSNs, are also “respondent provided” Only when a person claims age-dependent benefits (Social Security at 62 or Medicare at 65) does SSA get a birth certificate and apply a true edit to the birth date April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Lesson Editing, even informal editing, always involves building models even if they are difficult to lay out explicitly Formal edit models involve specifying all the logical relations among the variables (including impossibilities), then imposing them on a probability model like the ones we will consider in this lecture When an edit is made with probability 1 in such a system, the designers of the edit have declared that the expert’s prior judgment cannot be overturned by data In resolving the SIPP/SSA birth date example, the experts declared that the since benefit recipiency was based on audited (via birth certificates) birth dates, it should be taken as “true;” all other values were edited to agree. Note: still don’t have enough information to apply this edit on receipt of SIPP data Doesn’t help for those too young to claim age-eligible benefits. April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Edits and Missing Data Overview Missing data are a constant feature of both sampling frames (derived from censuses) and surveys Two important types are distinguished Missing record (frame) or interview (survey) Missing item (in either context) Methods differ depending upon type April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Missing Records: Frame or Census The problem of missing records in a census or sampling frame is detection By definition in these contexts the problem requires external information to solve April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Census of Population and Housing End-to-end census test (usually two years before) Pre-census housing list review Census processing of housing units found on a block not present on the initial list Post-census evaluation survey Post-census coverage studies April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Economic Censuses and the Business Register Discussed in lecture 4 Start with tax records Unduplication in the Business Register Weekly updates Multi-units updated with Report of Organization Survey Multi-units discovered during the inter-censal surveys are added to the BR April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Missing Records: Survey Non-response in a survey is normally handled within the sample design Follow-up (up to a limit) to obtain interview/data Assessment of non-response within sample strata Adjustment of design weights to reflect non-responses April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Missing and/or Inconsistent Items Edits during the interview (CAPI/CATI/Web form) and during post-processing of the survey Edit or imputation based on the other data in the interview/case (relational edit or imputation) Imputation based on related information on the same respondent, household, or business unit owner Longitudinal: earlier interviews Relational: other members of household, other establishments in the business unit Imputation based on statistical modeling Hot deck Cold deck Model-based: multiple imputation, and other model-based April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Census 2000 PUMS Missing Data Pre-edit: When the original entry was rejected because it fell outside the range of acceptable values Consistency: Edited or imputed missing characteristics based on other information recorded for the person or housing unit Hot Deck: Supplied the missing information from the record of another person or housing unit. Cold Deck: Supplied missing information from a predetermined distribution See allocation flags for details April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

American Community Survey (Current Procedures) ACS methodology chapter 10 Nonresponse to the mail-out solicitation goes to CATI/CAPI Mail responses given initial automated clerical edit; failures sent to failed-edit follow-up (FEFU), which can result in re-contact (TQA, telephone questionnaire assistance) or full CATI/CAPI interview Whether mailed-in, FEFU, TQA, CATI or CAPI all data enter the data capture file at this point April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

American Community Survey (Current Procedures) II Specialized automated coding procedures for edits to race, Hispanic origin, and ancestry (can involve clerical follow-up) Computer-assisted clerical coding for industry and occupation Automated geocoding for place of birth, migration, place of work (can result in clerical follow-up); place of work particularly troublesome (47% failure rate for automated coding) Quality of underlying data is assessed with an “acceptability index”; too low rejects the person record April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

American Community Survey (Current Procedures) III Missing data not corrected via edit are imputed Two primary imputation methods: Assignment: use of other reported data from respondent or household (e.g., sex, race, citizenship) Hot-deck allocation: discussed later in this lecture April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved CPS Missing Data Relational edit or imputation: use other information in the record to infer value (based on expert judgment) Longitudinal edits: use values from the previous month if present in sample (based on rules, usually) Hot deck: use values from actual respondents whose data are complete for the, relatively few, conditioning variables April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

County Business Patterns The County and Zip code Business Patterns data are published from the Employer Business Register This is important because variables used in these publications are edited to publication standards The primary imputation method is a longitudinal edit http://www.census.gov/econ/cbp/methodology.htm April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Economic Censuses Like demographic products, there are usually both edited and unedited versions of the publication variables in these files Publication variables (e.g., payroll, employment, sales, geography, ownership) have been edited Most recent files include allocation flags to indicate that a publication variable has been edited or imputed Many historical files include variables that have been edited or imputed but do not include the flags April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

QWI Missing Data Procedures (LEHD Infrastructure Files) Individual data Multiple imputation Employer data Relational edit Bi-directional longitudinal edit Single-value imputation Job data Use multiple imputation of individual data Multiple imputation of place of work Use data for each place of work April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

BLS National Longitudinal Surveys Non-responses to the first wave never enter the data Non-responses to subsequent waves are coded as “interview missing” Respondents are not dropped for missing an interview Special procedures are used to fill critical items from missed interviews when the respondent is interviewed again Item non-response is coded as such April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Federal Reserve Survey of Consumer Finances (SCF) General information on the Survey of Consumer Finances: http://www.federalreserve.gov/pubs/oss/oss2/scfindex.html Missing data and confidentiality protection are handled with the same multiple imputation procedure April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved SCF Details Survey collects detailed wealth information from an over-sample of wealthy households Item refusals and item non-response are rampant (see Kennickell article 2011) When there is item refusal, interview instrument attempts to get an interval The reported interval is used in the missing data imputation When the response is deemed sensitive enough for confidentiality protection, the response is treated as an item missing (using the same interval model as above) First major survey released with multiple imputation April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Relational Edit or Imputation Uses information from the same respondent Example: respondent provided age but not birth date. Use age to impute birth date. Example: some members of household have missing race/ethnicity data. Use other members of same household to impute race/ethnicity April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Longitudinal Edit or Imputation Look at the respondent’s history in the data to get the value Example: respondent’s employment information missing this month. Impute employment information from previous month Example: establishment industry code missing this quarter. Impute industry code from most recently reported code April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Cross Walks and Other Imputations In business data, converting an activity code (e.g., SIC) to a different activity code (e.g., NAICS) is a form of missing data imputation This was the original motivation for Rubin’s work (1996 review article) using occupation codes In general, the two activity codes are not done simultaneously for the same entity Often these imputations are treated as 1-1 when they are, in fact, many-to-many April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Probabilistic Methods for Cross Walks Inputs: original codes new codes information for computing Pr[new code | original code, other data] Processing Randomly assign a new code from the appropriate conditional distribution April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

The Theory of Model-based Missing Data Methods General principles Missing at random Weighting procedures Imputation procedures Hot decks Introduction to model-based procedures April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved General Principles Most of this lecture is taken from Statistical Analysis with Missing Data, 2nd edition, Roderick J. A. Little and Donald B. Rubin (New York: John Wiley & Sons, 2002) The basic insight is that missing data should be modeled using the same probability and statistical tools that are the basis of all data analysis Missing data are not an anomaly to be swept under the carpet They are an integral part of every analysis April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Missing Data Patterns Univariate non-response Multivariate non-response Monotone General File matching Latent factors, Bayesian parameters April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Missing Data Mechanisms The complete data are defined as the matrix Y (n  K). The pattern of missing data is summarized by a matrix of indicator variables M (n  K). The data generating mechanism is summarized by the joint distribution of Y and M. © John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016

Missing Completely at Random In this case the missing data mechanism does not depend upon the data Y. This case is called MCAR. © John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Missing at Random Partition Y into observed and unobserved parts. Missing at random means that the distribution of M depends only on the observed parts of Y. Called MAR. © John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Not Missing at Random If the condition for MAR fails, then we say that the data are not missing at random, NMAR. Censoring and more elaborate behavioral models often fall into this category. April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

The Rubin and Little Taxonomy Analysis of the complete records only Weighting procedures Imputation-based procedures Model-based procedures April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Analysis of Complete Records Only Assumes that the data are MCAR Only appropriate for small amounts of missing data Used to be common in economics, less so in sociology Now very rare April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Weighting Procedures Modify the design weights to correct for missing records Provide an item weight (e.g., earnings and income weights in the CPS) that corrects for missing data on that variable. See Bollinger and Hirsch discussion later in lecture See complete case and weighted complete case discussion in Rubin and Little April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Imputation-based Procedures Missing values are filled-in and the resulting “completed” data are analyzed Hot deck Mean imputation Regression imputation Some imputation procedures (e.g., Rubin’s multiple imputation) are really model-based procedures April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Imputation Based on Statistical Modeling Hot deck: use the data from related cases in the same survey to impute missing items (usually as a group) Cold deck: use a fixed probability model to impute the missing items Multiple imputation: use the predictive distribution of the missing item, given all the other items, to impute the missing data April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Current Population Survey Census Bureau imputation procedures: Relational imputation Longitudinal edit Hot Deck allocation procedure Winkler full edit/imputation system In the technical documentation of methodology for the Current Population Survey, the Census Bureau describes three imputation methods they use with CPS data. 1. Relational Imputation 2. Longitudinal edit 3. Hot deck allocation procedure The first two are more edit procedures than statistical imputation procedures. The examples they give for a relational imputation are when race is missing it is taken on another member of the household and imputed to the person. Longitudinal edit involves going back to a previous interview for the person and taking their answer from there. The hot deck allocation procedure assigns a missing value from a record with similar characteristics and in the same geographic region. Hot decks are always defined by age, race, and sex . Other characteristics used in hot decks vary depending on the nature of the question being referenced. April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved “Hot Deck” Allocation Labor Force Status Employed Unemployed Not in the Labor Force (Thanks to Warren Brown) Allison describes this procedure as non-parametric, which is a term for statistics that do not require a normal distribution. Also called distribution free statistics. Most common examples of non-parametric statistics are median and semi-interquartile range. Median instead of Mean as a measure of central tendency; and Semi-interquartile range as a description of variability instead of variance and standard deviation. Example from CPS Records are processed and answers to Labor Force Status are checked. This is a categorical variable and valid responses are: Employed, Unemployed, Not in the Labor Force April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved “Hot Deck” Allocation Black Non-Black Male 16 – 24 25+ ID #0062 Female 16-24 Hypothetically, the "hot deck" holds records based on Age X Race X Sex Age: 16-24, 25+ Race: Black, Non-Black Sex: Male, Female There are 8 cells. A White male, aged 65 is processed and has a valid answer for labor force status. He is employed. His record then bumps the record in that cell and occupies that cell in the hot deck. April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved “Hot Deck” Allocation Black Non-Black Male 16 – 24 ID #3502 ID #1241 25+ ID #8177 ID #0062 Female 16-24 ID #9923 ID #5923 ID #4396 ID #2271 The other cells are filled in a similar manner. Every time a person with complete answers for questions on: Sex, Age, Race and Labor Force Status is processed they bump whoever occupied the “Hot Deck” cell. When a record with missing Labor Force Status comes along, it gets assigned the Labor Force Status of the person whose record is in the appropriate cell. The advantages of the "hot deck" allocation procedure are that answers are random, valid actual responses, for a similar person based on other characteristics, sensitive to differences between geographic regions, April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved CPS Example Effects of hot-deck imputation of labor force status April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Public Use Statistics April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Allocated v. Unallocated April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Bollinger and Hirsch CPS Missing Data Studies the particular assumptions in the CPS hot deck imputer on wage regressions Census Bureau uses too few variables in its hot deck model Inclusion of additional variables improves the accuracy of the missing data models See Bollinger and Hirsch (2006) April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Model-based Procedures A probability model based on p(Y, M) forms the basis for the analysis This probability model is used as the basis for estimation of parameters or effects of interest Some general-purpose model-based procedures are designed to be combined with likelihood functions that are not specified in advance April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Little and Rubin’s Principles Imputations should be Conditioned on observed variables Multivariate Draws from a predictive distribution Single imputation methods do not provide a means to correct standard errors for estimation error April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Applications to Complicated Data Computational formulas for MI data Examples of building Multiply-imputed data files April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Computational Formulas Assume that you want to estimate something as a function of the data Q(Y) Formulas account for missing data contribution to variance © John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Examples Survey of Consumer Finances Quarterly Workforce Indicators April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Survey of Consumer Finances Codebook description of missing data procedures Sensitive survey because of large very-wealthy oversample (based on IRS list of most wealthy households in the U.S.) The missing data and confidentiality protection procedures are both based on modeling the complex set of choices given to respondents about how much wealth information to reveal When a respondent does not want to provide an answer, bracketed interval choices are presented Missing data are multiply-imputed based on modeling the conditional distribution of the bracketed intervals, given covariates, and ignorability; see Kennickell (2001) April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved How are the QWIs Built? Raw input files: UI wage records QCEW/ES-202 report Decennial census and ACS files SSA-supplied administrative records Census-derived administrative record household address files LEHD geo-coding system Processed data files: Individual characteristics Employer characteristics Employment history with earnings April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Processing the Input Files Each quarter the complete history of every individual, every establishment, and every job is processed through the production system Missing data on the individuals are multiply imputed at the national level, posterior predictive distribution is stored Missing data on the employment history record are multiply imputed each quarter from fresh posterior predictive distribution Missing data on the employer characteristics are singly-imputed (explanation to follow) April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Examples of Missing Data Problems Missing demographic data on the national individual file (birth date, sex, race, ethnicity, place of residence, and education) Multiple imputations using information from the individual, establishment, and employment history files Model estimation component updated irregularly Imputations performed once for those in estimation universe, then once when a new PIK is encountered in the production system This process was used on the current QWI and for the S2011 snapshot April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

A Very Difficult Missing Data Problem The employment history records only code employer to the UI account level Establishment characteristics (industry, geo-codes) are missing for multi-unit establishments The establishment (within UI account) is multiply imputed using a dynamic multi-stage probability model Estimation of the posterior predictive distribution depends on the existence of a state with establishments coded on the UI wage record (MN) April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved How Is It Done? Every quarter the QWI processes over 6 billion employment histories (unique person-employer pair) covering 1990 to 2015 Approximately 30-40% of these histories require multiple employer imputations So, the system does more than 25 billion full information imputations every quarter The information used for the imputations is current, it includes all of the historical information for the person and every establishment associated with that person’s UI account April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved Does It Work? Full assessment of total jobs, beginning-of-quarter employment, full-quarter employment, monthly earnings of full-quarter employed, total payroll Older assessment using the state that codes both (MN) Summary slide follows April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

© John M. Abowd and Lars Vilhuber 2016, all rights reserved April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved

Cumulative Effect of All QWI Edits and Imputations Average Z-scores Missingness Rates Sample Size Entity Size* Average Employment Beginning Employment (b) Full-quarter Employment (f) Earnings (z_w3) all 437 8.79 8.09 10.15 27.1% 31.2% 237,741 1-9 4 1.60 1.46 2.99 33.1% 33.3% 43.6% 95,520 10-99 35 4.84 4.40 6.69 24.4% 25.7% 84,621 100-249 160 11.08 10.14 13.52 21.8% 21.6% 20.5% 21,187 250-499 354 16.66 15.29 19.37 20.9% 19.1% 11,972 500-999 707 23.59 21.68 26.03 20.7% 20.6% 17.9% 8,787 1000+ 5538 56.67 52.61 52.11 20.2% 20.1% 16.1% 15,654 *Entity is county x NAICS sector x race x ethnicity for 2008:q3. Z-score is the ratio of the QWI estimate to the square root of its total variation (within and between implicate components) Missingness rate is the ratio of the between variance to the total variance April 11, 2016 © John M. Abowd and Lars Vilhuber 2016, all rights reserved