Multiple Imputation Approaches for Right-Censored Wages in the German IAB Employment Register European Conference on Quality in Official Statistics 2008,

Slides:



Advertisements
Similar presentations
Handling attrition and non- response in longitudinal data Harvey Goldstein University of Bristol.
Advertisements

The Simple Linear Regression Model Specification and Estimation Hill et al Chs 3 and 4.
Research on Improvements to Current SIPP Imputation Methods ASA-SRM SIPP Working Group September 16, 2008 Martha Stinson.
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Lecture 8 Relationships between Scale variables: Regression Analysis
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
The Simple Linear Regression Model: Specification and Estimation
Chapter 13 Additional Topics in Regression Analysis
Chapter 10 Simple Regression.
Lecture 10 Comparison and Evaluation of Alternative System Designs.
How to deal with missing data: INTRODUCTION
Statistical Methods for Missing Data Roberta Harnett MAR 550 October 30, 2007.
Simple Linear Regression Analysis
UNECE Workshop on Confidentiality Manchester, December 2007 Comparing Fully and Partially Synthetic Data Sets for Statistical Disclosure Control.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Inference for regression - Simple linear regression
Chapter 11 Simple Regression
Fundamentals of Statistical Analysis DR. SUREJ P JOHN.
Determining Sample Size
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Workshop on methods for studying cancer patient survival with application in Stata Karolinska Institute, 6 th September 2007 Modeling relative survival.
© Federal Statistical Office, Research Data Centre, Maurice Brandt Folie 1 Analytical validity and confidentiality protection of anonymised longitudinal.
Anna Lovász Institute of Economics Hungarian Academy of Sciences June 30, 2011.
1 Introduction to Estimation Chapter Concepts of Estimation The objective of estimation is to determine the value of a population parameter on the.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
PARAMETRIC STATISTICAL INFERENCE
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
Multiple Imputation (MI) Technique Using a Sequence of Regression Models OJOC Cohort 15 Veronika N. Stiles, BSDH University of Michigan September’2012.
Limited Dependent Variable Models ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
G Lecture 11 G Session 12 Analyses with missing data What should be reported?  Hoyle and Panter  McDonald and Moon-Ho (2002)
Applied Epidemiologic Analysis - P8400 Fall 2002 Lab 10 Missing Data Henian Chen, M.D., Ph.D.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
1 G Lect 14M Review of topics covered in course Mediation/Moderation Statistical power for interactions What topics were not covered? G Multiple.
IAB homepage: Institut für Arbeitsmarkt- und Berufsforschung/Institute for Employment Research A New Approach for Disclosure Control in the.
© John M. Abowd 2005, all rights reserved Multiple Imputation, II John M. Abowd March 2005.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
EC 532 Advanced Econometrics Lecture 1 : Heteroscedasticity Prof. Burak Saltoglu.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
A shared random effects transition model for longitudinal count data with informative missingness Jinhui Li Joint work with Yingnian Wu, Xiaowei Yang.
Tutorial I: Missing Value Analysis
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
1/61: Topic 1.2 – Extensions of the Linear Regression Model Microeconometric Modeling William Greene Stern School of Business New York University New York.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
Small area estimation combining information from several sources Jae-Kwang Kim, Iowa State University Seo-Young Kim, Statistical Research Institute July.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Confidential and Proprietary Business Information. For Internal Use Only. Statistical modeling of tumor regrowth experiment in xenograft studies May 18.
Heteroscedasticity Heteroscedasticity is present if the variance of the error term is not a constant. This is most commonly a problem when dealing with.
Heteroscedasticity Chapter 8
Inference for Least Squares Lines
Multiple Imputation using SOLAS for Missing Data Analysis
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
The Simple Linear Regression Model: Specification and Estimation
Maximum Likelihood & Missing data
Microeconometric Modeling
Multiple Imputation.
Charles University Charles University STAKAN III
How to handle missing data values
Linear Regression.
CONCEPTS OF ESTIMATION
Microeconometric Modeling
Ch13 Empirical Methods.
The European Statistical Training Programme (ESTP)
Non response and missing data in longitudinal surveys
Clinical prediction models
Chapter 13: Item nonresponse
Presentation transcript:

Multiple Imputation Approaches for Right-Censored Wages in the German IAB Employment Register European Conference on Quality in Official Statistics 2008, 10 July 2008 Thomas Büttner Institute for Employment Research (IAB) Susanne Rässler University of Bamberg

2 1. Motivation 2. Imputing Censored Wages 3. A Multiple Imputation Approach Considering Heteroscedasticity 4. Simulation Study 5. Results Overview

3  For a large number of research questions it is interesting to use wage data - Analyzing the gender wage gap - Measuring overeducation - …  To address this kind of questions two types of data often are used: - Survey data - Administrative data from the social security  Advantages of administrative data - Large number of observations - No response burden - No interviewer bias Motivation

4  Administrative data  Represents 80 percent of the employees in Germany  2 percent random sample of all employees covered by social security  1.3 million persons  Problem: Wages can only be recorded up to the contribution limit of the social security system The wage information is censored at this limit The German IAB Employment Sample (1)  Sample drawn from the IAB register data (employment history) supplemented by information on benefit recipients

5 The German IAB Employment Sample (2) Daily wages in logs in Western Germany (2000) Source: IAB Employment Sample

6  Several possibilities to deal with censored wages  Advantage of multiple imputation: The imputed data set can be used for a multiplicity of questions and analyses  e.g. average wages of certain groups, Analyzing regional wage dispersions, effects of a modification of the contribution limit…  The conventional approaches assume homoscedasticity of the residuals Censored Wages Since in general the dispersion of income is smaller in lower wage categories than in higher categories, the assumption of homoscedasticity is highly questionable with wage data

7 Our Project  Step 1: Developing approaches considering heteroscedasticity  Step 2: Simulation study to confirm the necessity and validity of the new approaches  Step 3: Using uncensored wage information from an income survey (German Structure of Earnings Survey, GSES) to validate the approaches  Step 4: Using external wage information for the imputation model

8 Imputation Models  Single Imputation based on a homoscedastic tobit model  Single Imputation using a heteroscedastic model  Multiple Imputation based on a homoscedastic tobit model  Multiple Imputation considering heteroscedasticity

9  Single imputation based on a homoscedastic tobit model if where a is the contribution limit  Imputation by draws of random values according to the parameters estimated using a tobit model  As the true values are above the contribution limit, draws from a truncated normal distribution Single Imputation

10  Development of an imputation approach considering heteroscedasticity (single imputation) based on a GLS model for truncated variables  Imputation by draws from a truncated normal distribution using individual variances Single imputation may lead to biased variance estimations (Little/Rubin 1987) Single Imputation Considering Heteroscedasticity

11 Multiple Imputation (1) 1 Impute the data set m times 2 Analyze each data set 3 Combine the results

12 Multiple Imputation (2) 1. To be able to start the imputation based on MCMC, we first need to adapt starting values for the parameters from a ML tobit estimation 2. In the imputation step, we randomly draw values for the missing wages from a truncated distribution 3. Based on the imputed data set, we compute an OLS regression 4. After this, we produce random draws for the parameters according to their complete data posterior distribution 5. We repeat the imputation and the posterior-step 5,000 times and use to obtain 5 complete data sets

13 Imputation Model Considering Heteroscedasticity (1) Based on the multiple imputation approach with additional draws for describing the functional form of the heteroscedasticity 1. We now start the imputation by adapting starting values from a GLS estimation 2. Then we are able to draw values for the missing wages from a truncated distribution using individual variances 3. Then a GLS regression is computed based on the imputed data set

14 Imputation Model Considering Heteroscedasticity (2) 4.Afterwards we perform random draws for and 5.Now the parameter can be drawn randomly according to their complete data posterior distribution 6. The steps 2 to 5 are repeated again 5,000 times and we use to obtain 5 complete data sets

15  IAB Employment Sample 2000 (30 June 2000)  Only male persons from Western Germany  Only full time workers covered by social security Simulation Study About 210,000 Persons, about 23,000 or 11 percent with an income above the contribution limit

16 Creating Complete Data Sets  As the IAB Employment Sample is censored, we first have to create complete data sets  We create two different data sets:  one data set using an approach presuming homoscedasticity  another data set using an approach considering heteroscedasticity of the residuals

IABS with censored wages 2. Creating complete data sets (with and without heteroscedasticity), calculating β 3. Defining a new limit 4. Drawing a random sample of 10 percent 5. Imputing the wage using the different approaches, computing a regression Simulation Study 6. Calculating the fraction of confidence intervals of containing the true parameter β for the different approaches

18 Results of the Homoscedastic Data Set HOM complete dataSISI-HetMIMI-Het coverage educ educ educ educ educ educ level level level level age sqage nation cons

19 Results of the Heteroscedastic Data Set HET complete dataSISI-HetMIMI-Het coverage educ educ educ educ educ educ level level level level age sqage nation cons

20 Simulation study using external wage information (1)  Scientific-Use-File of the German Structure of Earnings Survey (GSES) 2001  Linked Employer-Employee data set  Information on about establishments and about employees  Information on - individuals (e.g. sex, age, education) - jobs (e.g. occupation, job level, working times) - income (e.g. gross wage, net wage, income taxes) - and establishments

21 Simulation study using external wage information (2)  Selection of a sample comparable to the first simulation study  Complete data set containing persons  Censoring at the 85 percent quantile

22 Simulation study using external wage information (3)

23 Outlook  Using uncensored information from survey data for the imputation model  Inference under uncongeniality  Validation of our approach by reproducing different studies using imputed data

24 References Bender, S., Haas, A. and Klose, C. (2000). IAB Employment Subsample Opportunities for Analysis Provided by Anonymised Subsample. IZA Discussion Paper117, IZA Bonn. Buchinsky, M. (1994). Changes in the U.S. wage structure 1963–1987: Application of quantile regression. Econometrica 62(2), 405–458. Gartner, H. (2005). The imputation of wages above the contribution limit with the German IAB employment sample. FDZ Methodenreport 2/2005. Gartner, H. and Rässler, S. (2005). Analyzing the changing gender wage gap based on multiply imputed right censored wages. IAB Discussion Paper 05/2005. Jensen, U., Gartner, H. and Rässler, S. (2006). Measuring overeducation with earnings frontiers and multiply imputed censored income data. IAB Discussion Paper Nr. 11/2006. Khan, S. and Powell, J.L. (2001). Two-step estimation of semiparametric censored regression models. Journal of Econometrics 103, 73–110. Little, R.J.A and Rubin D.R. (1987). Statistical Analysis with Missing Data. John Wiley, New York, 1 edn. Meng, X.L. (1994). Multiple Imputation Inferences with Uncongenial Sources of Input. Statistical Sciences Volume 9, Powell, J.L. (1986). Symmetrically Trimmed Least Squares Estimation for Tobit Models. Econometrica 54(6), Rässler, S. (2006). Der Einsatz von Missing Data Techniken in der Arbeitsmarktforschung des IAB. Allgemeines Statistisches Archiv. Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. J.Wiley & Sons, New York. Schafer, J.L. and Yucel, R.M (2002). Computational Strategies for Multivariate Linear Mixed-Effects Models With Missing Values. Journal of Computational and Graphical Statistics Volume Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. Chapman & Hall, New York.

25 Combining Rules The associated variance estimate has two components. The within-imputation-variance is the average of the complete-data-variance estimates: The between-imputation-variance is the variance of the complete-data point estimates: The total variance is defined as: Multiple Imputation point estimate for is defined as:

26 First Results  The simulation study using these three approaches shows the necessity of a new method that multiply imputes the missing wages and does not presume heteroscedasticity  Second step: Development of a new multiple imputation approach considering heteroscedasticity  Finally we perform a new simulation study to compare the four approaches under different situations in order to confirm the necessity as well as the validity of the new approach

27 We use a two-step procedure for each of the m draws: 1.We perform random draws of the parameter according to their observed-data posterior distribution 2.We make random draws of Y mis according to their conditional predictive distribution The first step is problematical as the observed data posteriors are often no standard distributions. Therefore we draw from and the desired distributions are achieved as stationary distributions of Markov Chains. Multiple Imputation

28 Principle of multiple Imputation (2)  Advantage: Considers the additional uncertainty  Principle: Based on independent random draws from the posterior predictive distribution of the missing data given the observed data  Problem:It may be difficult to draw from  But:

29 Simulation Study (2)  The simulation procedure consisting of  drawing a random sample,  deleting the wages above the limit  imputing the data using the different approaches,  computing a regression,  and calculating the confidence intervals is repeated 1000 times.  Coverage: The fraction of confidence intervals of containing the true parameter β for the different approaches

30 Summary of Results  In case of a homoscedastic structure of the residuals the same quality of imputation results can be expected from the two multiple imputation approaches  In case of heteroscedasticity the simulation study confirms the necessity of our new approach  Since the structure of the wages in the IAB employment register is heteroscedastic, the results of the simulation study necessitate the use of the new approach to impute the missing wage information in this register