Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA

In this course, we will… Examine missing data in a general sense; what it is, where it comes from, what types exist, etc. Explain the problems of certain common methods for dealing with missing data, such as complete case analysis and single imputation methods Study multiple imputation (MI), learning generally how it works Apply MI to real data sets using SAS and R

So what is missing data? Missing data is information that we want to know, but don’t It can come in many forms, from people not answering questions on surveys, to inaccurate recordings of the height of plants that need to be discarded, to canceled runs in a driving experiment due to rain We could also consider something we never even thought of to be missing data

The key question is, why is the data missing? What mechanism is it that contributes to, or is associated with, the probability of a data point being absent? Can it be explained by our observed data or not? The answers drastically affect what we can ultimately do to compensate for the missingness

Perhaps the most common method of handling missing data is “Complete Case Analysis” Simply delete all cases that have any missing values at all, so you are left only with observations with all variables observed Computer software often does this by default when performing analysis (regression, for example) This is the simplest way to handle missing data. In some cases, will work fine However, loss of sample will lead to variance larger than reflected by the size of your data. May bias your sample

And now a closer look… We use as an example a data set of body fat percentage in men, and the circumference of various body parts (Penrose et al., 1985) Does the circumference of certain body parts predict body fat percentage? Here are some significant figures from a regression model with body fat percentage as the response PredictorEstimateS.E.P-Value Age0.06260.03130.0463 Neck-0.47280.22940.0403 Forearm0.453150.19790.0229 Wrist-1.61810.53230.0026

In this case, the data is complete, with sample size 252 But suppose about 5 percent of the participants had missing values? 10 percent? 20 percent? What if we performed complete case analysis and removed those who had missing values? First let’s examine the effect if we do this if when the data is MCAR I randomly removed cases from the data set, reran the analysis and stored the p-values. I did this 1,000 times, and plotted the 1,000 p-values in boxplots

For about 5 percent (n=13) deleted AgeNeck Forearm Wrist

For about 20 percent (n=50) deleted Age Neck Forearm Wrist

We seem to change our conclusions somewhat With age and neck, it seems we fail to reject more often than not The other two, we still reject most of the time This is assuming the missing subjects do mot differ from the non-missing. This would cause bias.

Types Of Missingness Missing Completely at Random (MCAR) Missing at Random (MAR) Missing Not at Random (MNAR) or Not Missing at Random (NMAR)

What Distinguishes Each Type? Suppose you’re loitering outside an elementary school one day… You then find out that students just received their report cards for the first quarter For some reason, you start asking passing students their English grades. Of course, you don’t force them to tell you or anything. You also write down their gender and hair color

A data set from this activity might look like this… Hair ColorGenderGrade RedMA BrownFA BlackFB MA BrownM M F BlackMB MB BrownFA BlackF BrownFC RedM FA BrownMA BlackMA 7 students received As, 3 received Bs, and 1 a C No failing!! But 5 students did not reveal their grade

To determine the type of missingness, look at what influences the probability of a missing point Hair ColorGenderGrade 000 000 000 000 00 1 00 1 00 1 000 000 000 00 1 000 00 1 000 000 000 Here is the same data set, but the values are replaced with a “0” if the data point is observed and “1” if it is not We’ll call this the “Missing Matrix.” Obviously there are many more possible missing matricies The relevant question is, for any one of these data points, what is the probability that the point is equal to “1” ?

Upcoming Quiz! What type of missingness do the grades exhibit?

Missing Completely at Random (MCAR) If this probability is not dependent on any of the data, observed or unobserved, then the data is Missing Completely at Random (MCAR) To be more precise, suppose that X is the observed data and Y is the unobserved data. Suppose we label our “Missing Matrix” as R. Then, if the data are MCAR, P(R|X,Y)=P(R)

Example… Suppose you are running an experiment on plants grown in pots, when suddenly you have a nervous breakdown and smash some of the pots In your insanity, you will probably not likely choose the plants to smash in a well-defined pattern, such as height age, etc. Hence, the missing values generated from your act of madness will likely fall into the MCAR category

Another way to think of MCAR Supposed we had to quickly go to the bathroom and do number 2 In our desperation, we use the data as our toilet paper Presumably, some of our data would be smeared with…you know what The data smeared can be said to be a random subset of our data

In practice, MCAR is usually not realistic A completely random mechanism for generating missingness in your data set just isn’t very realistic Usually, missing data is missing for a reason. Maybe older people are less likely to answer web-delivered questions on surveys, or in longitudinal studies people may die before they have completed the entire study, etc., companies may be reluctant to reveal financial information, etc.

Missing at Random (MAR) If the probability of your missing data is dependent on the observed data but not the unobserved data, your missing observations are said to be Missing at Random (MAR) Symbolically, P(R|X,Y)=P(R|X), so that the unobserved data does not contribute to the probability of observing our “Missing Matrix.” Random is somewhat of a misnomer. MAR means that there is a mechanism that is associated with whether the data is missing, and it has to do with our observed data

Example… Usually, missing data is missing for a reason. Maybe older people are less likely to answer web-delivered questions on surveys, or in longitudinal studies people may die before they have completed the entire study, etc., companies may be reluctant to reveal financial information, etc.

The key point to MAR is… We can still model the missing mechanism and compensate for it The multiple imputation methods we will be talking about today assume MAR For example, if age is known, you can model missingness as a function of age Whether or not missing data is MAR or the next type, Missing Not at Random (MNAR) is not testable. Requires you to understand your data (after)

Missing Not at Random (MNAR) The missingness has something to do with the missing value itself It has been said that smokers are not as likely to answer the question, “Do you smoke?” Said to be nonignorable Although there are some proposed ways to handle MNAR data, these are more complicated and are beyond the scope of this class

So, returning to our school example… Do you think this missing data is likely MCAR, MAR or MNAR? Hair ColorGenderGrade RedMA BrownFA BlackFB MA BrownM M F BlackMB MB BrownFA BlackF BrownFC RedM FA BrownMA BlackMA

Add overall GPA Now the data looks like this Does this change anything? Hair ColorGPAGenderGrade Red 3.4 MA Brown 3.6 FA Black 3.7 FB Black 3.9 MA Brown 2.5 M Brown 3.2 M Brown 3.0 F Black 2.9 MB Black 3.3 MB Brown 4.0 FA Black 3.65 F Brown 3.4 FC Red 2.2 M Red 3.8 FA Brown 3.8 MA Black 3.67 MA

So what do we do about missing data?

Single Imputation Methods Impute Once Mean Imputation: imputing the average from observed cases for all missing values of a variable Hot Deck Imputation: imputing a value from another subject, or “donor,” that is most like the subject in terms of observed variables Some others All fundamentally impose too much precision. We have uncertainty in what the unobserved values actually are

Multiple Imputation Using a single imputation approach does not account for an obvious source of uncertainty By imputing only once, we are treating the imputed value as if we observed it when we did not Therefore, we have uncertainty in what the observed value would have been Multiple Imputation (MI) takes this into account by generating several random values for each missing data point

The General Process 1. A value is randomly drawn for the unobserved data points based on a predetermined model from the observed data 2. Repeat step 1 some number of times, say N, resulting in N imputed data sets 3. Each imputed data set is analyzed separately 4. The separate analyses are pooled together for a unifying analysis that takes into account all the imputed data sets

To illustrate… 322 43 ? 566 25 ? 845 Here’s some data XY

Oh no, we have two missing values! Whatever shall we do?!

Let’s Impute Some Data! 322 43 566 25 845 X Y 5.5 8 First, we’ll use a predictive distribution of the missing values, given the observed values, to make random draws of the observed values and fill them in. Now we have one imputed data set!

Let’s Set That Aside… 322 43 566 25 845 X Y 5.5 8 And Do it Again!!!! 322 43 566 25 845 X Y 7.2 1.1

Set that aside…

Now we have 2 imputed data sets!!! 322 43 566 25 845 X Y 7.2 1.1 322 43 566 25 845 X Y 5.5 8

Do this m number of times for m imputed data sets

Inference with Multiple Imputation Now that we have our imputed data sets, how do we make use of them? (suppose in this case m = 2) 322 43 566 25 845 X Y 7.2 1.1 322 43 566 25 845 X Y 5.5 8

We analyze each separately 322 43 566 25 845 X Y 7.2 1.1 322 43 566 25 845 X Y 5.5 8 Slope4.932 S.E.4.287 Slope-0.8245 S.E.6.1845

Finally we pool the analyses together

Predicting the missing data given the observed data

Imagine, then, that we establish some distribution of parameters of interest before considering the data, P(θ), where θ is the set of parameters we are trying to estimate. This is called the prior distribution of θ. Then, we establish a distribution P(X obs |θ) We can finally use Bayes Theorem to establish P(θ|X obs ), make random draws for θ, and use these draws to make predictions of Y miss

How many imputations do we need?

General Methods for Multiple Imputation Regression based Chained Equations (MICE) or Fully Conditional Specification (FCS) Markov Chain Monte Carlo (MCMC)

We will look at part of a data set of CEO bonuses, with other predictor variables (sales, advanced degrees, age, etc.)

Regression Approach in SAS Uses predictive mean matching, which means that the actual imputed value is one chosen randomly from a set of observed values whose predicted value is close to the predicted value of the missing observation Is meant to try and keep imputed values plausible Based on the imputation model we build, posterior random draws are made for the regression parameters These draws are used to construct the predicted values for the missing observation

What parameters?

SAS example We will look at part of a data set of CEO bonuses, with other predictor variables (sales, advanced degrees, age, etc.) Since we plan to do regression on bonuses, and bonuses may have large variability as they get higher, we will take the log of bonuses before we do the imputation

Here’s the code for the entire process: proc mi data=bob1 out=mob seed=123 nimpute=10; monotone regpmm(logb=stock sales years mba mastphd age); var stock sales years mba mastphd age logb; run; proc reg data=mob outest=mp covout noprint; model logb=stock sales years mba mastphd age; by _imputation_; run; Imputation Code Regression Code proc mianalyze data=mp; modeleffects stock sales years mba mastphd age; run; Pooled Analysis Code

Here is the output ParameterVariance BetweenWithinTotal stock1.2212E-053.5E-054.9E-05 sales1.15E-121.02E- 11 1.15E-11 years6.388E-062.5E-053.2E-05 mba0.0014230.007940.0095 mastphd0.0012990.006330.00776 age1.0735E-052.5E-053.7E-05 ParameterEstimateStd Error 95% Confiden ce Limits Theta0t for H0:Pr > |t| Paramet er=Thet a0 stock-0.005560.0069 7 -0.01940.008240-0.80.4267 sales2.42E-053.4E- 06 0.0000 2 3.1E-0507.16<.0001 years0.0176940.0056 2 0.00660.0287903.150.0019 mba0.0143430.0974 9 -0.17740.2061200.150.8831 mastphd-0.001820.0880 9 -0.17530.171620-0.020.9835 age0.0148960.0060 8 0.0028 1 0.0269802.450.0163

Classification Variables Suppose that we want to impute a variable that takes one of two values, “male” or “female”, “smoker” or “non smoker”, “dead” or “alive” Or what if there are even more categories, such as dislike, like, and love? What if they are nominal, like chocolate, vanilla, and strawberry? We can hardly use continuous methods in these cases

We can use the “Logistic Regression Method”

This method also works for ordinal data Can be performed sequentially in SAS on multiple variables one at a time if data is monotone missing, which means an observation missing implies observations missing in all the rest of the variables for that subject Discriminant Function Method can be used for nominal variables

SAS Example I took the CEO data set and removed 57 values (no particular reason I chose 57) The following code runs the imputation proc mi data=bob nimpute=5 seed=231 out=lid; class mastphd; var age stock sales years mba mastphd; monotone logistic (mastphd=years sales stock age mba); run;

And we get this… WARNING: The maximum likelihood estimates for the logistic regression with observed observations may not exist for variable MastPHD. The posterior predictive distribution of the parameters used in the imputation process is based on the maximum likelihood estimates in the last maximum likelihood iteration.

The answer lies in the follies of logistic regression, as well as the redundancy of our model Table of MastPHD by MBA MastPH D(MastP HD) MBA(MBA) 01Total 0372 0 50.070 1000 67.640 1178193371 23.9625.9849.93 47.9852.02 32.36100 Total550193743 74.0225.98100 We have “perfect classification” in that no one without a masters/phd has an mba If we have perfect classification like this, then the algorithm that does logistic regression will not converge This is something you need to be careful about in general

Now that we’ve removed MBA, here’s the code proc mi data=bob nimpute=5 seed=231 out=lid; class mastphd; var age stock sales years mastphd; monotone logistic (mastphd=years sales stock age); run; proc logistic data=lid outest=rain covout noprint descending; class mastphd; model mastphd=age stock sales years; by _imputation_; run; proc mianalyze data=rain; modeleffects age stock sales years; run; Imputation Code Logistic Regression Code Pooled Analysis Code

So here are the results Variance Information ParameterVarianceDFRelativeFractionRelative IncreaseMissingEfficiency BetweenWithinTotalin Variance Information age7.2861E-050.000150.0002328.1850.6044150.4166920.923073 stock1.3164E-050.000420.000443084.80.0373550.0366340.992727 sales8.39E-135.41E- 11 5.51E-11119900.0186050.0184290.996328 years1.8885E-050.000140.00016204.210.1627330.1482590.971202 Parameter Estimates Parameter EstimateStd Error 95% Confidence Li mits DFMinimumMaximumTheta0t for H0: Pr > |t| Param eter=T heta0 age -0.0305920.01524-0.06180.0006128.185-0.040456-0.022810-2.010.0543 stock -0.0871360.02095-0.1282-0.04613084.8-0.090364-0.0821140-4.16<.0001 sales 3.554E-067.4E-06-1E-050.00002119900.0000026890.00000458100.480.6321 years 0.0052740.01273-0.01980.03036204.210.0017050.01007500.410.679

What about when more than one variable has missing values?

Multiple Imputation by Chained Equations (MICE) 1.Provides initial imputations of missing values 2.For one particular variable, removes them again 3.Builds model based on other variables, and uses posterior predictive distribution to impute random values 4.Does the same thing for another variable, only imputed values for first variable remain 5.Completes for all variables, repeats the process many times 6.This makes one imputed data set. Does so m times

Works well in simulations, handles many types of variables at once Can take a lot of time, and theoretical justification is not particularly strong

R Example This data set, “Nhanes”, has age group, body mas index, hypertensive status and serum cholesterol Body mass index and serum cholesterol are continuous, while hypertensive status (yes or no) is binary and age group is ordinal We will use the package “mice” and the function “mice” to complete the imputation and analysis

Code You need to install the ‘mice’ package nhanes$hyp<-as.factor(nhanes$hyp) bord<-mice(nhanes,m=40,seed=132, me=c("polr","pmm","logreg","norm")) complete(bord,12) bit<-with(bord,lm(chl~age+bmi+hyp)) summary(pool(bit))

Output est se t df Pr(>|t|) (Intercept) -39.104424 88.462185 -0.4420468 9.341691 0.66851235 age 40.287101 18.378020 2.1921350 6.268912 0.06894168 bmi 6.091045 2.610044 2.3336941 11.449700 0.03876241 hyp 5.410891 29.405394 0.1840102 8.038752 0.85856252 Body mass is a significant predictor of cholesterol, and age nearly is, but hypertensive status is not

Markov Chain Monte Carlo Approach Here, the process gives us data estimations via Markov chains A Markov chain holds the property that the probability of the next link in the chain depends only on the current link Basically, we perform a bunch of steps, and the probability of each step depends only on the previous step Eventually, theory holds that under certain conditions, the steps will converge to the state that we are trying to estimate, called the stationary distribution

But, there’s a catch… This approach assumes multivariate normality

Summary Though handling missing data is ultimately just a nuisance necessity and not the point of the analysis, it pays to give it the consideration it is due Whether or not you use multiple imputation, single imputation, or complete case analysis depends on how much missing data you have, and how big the sample is Having the actual data is still always better

Thank you!

Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Similar presentations

Presentation on theme: "Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Similar presentations

Presentation on theme: "Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA."— Presentation transcript:

Similar presentations

About project

Feedback