Download presentation
Presentation is loading. Please wait.
Published byErick Cummings Modified over 9 years ago
1
Missing Values C5.2 Data Screening
2
Missing Data Use the summary function to check out the missing data for your dataset. summary(notypos)
3
Missing Data Missing data is an important problem. First, ask yourself, “why is this data missing?” – Because you forgot to enter it? – Because there’s a typo? – Because people skipped one question? Or the whole end of the scale?
4
Missing Data Two Types of Missing Data: – MCAR – missing completely at random (you want this) – MNAR – missing not at random (eek!) There are ways to test for the type, but usually you can see it – Randomly missing data appears all across your dataset. – If everyone missed question 7 – that’s not random. – (click on the dataset or use the View() function.
5
Missing Data MCAR – probably caused by skipping a question or missing a trial. MNAR – may be the question that’s causing a problem. – For instance, what if you surveyed campus about alcohol abuse? What does it mean if everyone skips the same question?
6
Missing Data How much can I have? – Depends on your sample size – in large datasets <5% is ok. – Small samples = you may need to collect more data. Please note: there is a difference between “missing data” and “did not finish the experiment”.
7
Missing Data How do I check if it’s going to be a big deal? – Try running your analysis on the dataset with missing data versus the dataset with the missing data filled in. – In R that’s easy! Yeah! You just change out the name of the dataset you are using, since we are saving them separately as we go.
8
Missing Data Deleting people / variables You can exclude people “pairwise” or “listwise” – Pairwise – only excludes people when they have missing values for that analysis – Listwise – excludes them for all analyses Variables – if it’s just an extraneous variable (like GPA) you can just delete the variable
9
Missing Data What if you don’t want to delete people (using special people or can’t get others)? – Several estimation methods to “fill in” missing data
10
Missing Data Mean substitution – the old way to enter missing data – Conservative – doesn’t change the mean values used to find significant differences – Does change the variance, which may cause significance tests to change with a lot of missing data
11
Missing Data Multiple imputation / expected maximization – now considered the best at replacing missing data – Creates an expected values set for each missing point – Using matrix algebra, the program estimates the probably of each value and picks the highest one
12
Missing Data DO NOT mean replace categorical variables – You can’t be 1.5 gender. – So, either leave them out OR pairwise eliminate them (aka eliminate only for the analysis they are used in).
13
Missing Data DataCategorical/IVsSTOPContinuous/DVsMnarSTOPMcarMore > 5%STOPLess < 5%MICE
14
Missing Data Figure out what you can replace. – First, figure out the percent missing by column. – Then, figure out the percent missing by row. Let’s write a function!
15
Missing Data Make up our own percent missing function. Percentmiss = ##save the function – function(x){ ##this line says make a new function – sum(is.na(x)) ## this line totals up the number of NA values – /length(x) ##divide by the length of the values – * 100 ##gives us the percent – } ##close function
16
Missing Data Let’s use apply to get percent missing by columns and rows. – apply(notypos, 2, percentmiss) ##columns – We will have to exclude several of these columns.
17
Missing Data Now, let’s use apply to get percent missing by rows – apply(notypos, 1, percentmiss) Too much info! missing = apply(notypos, 2, percentmiss) table(missing)
18
Missing Data Install the mice package. Load the mice library. Select only the data that you want to run mice on: – Eliminate bad rows. – Eliminate bad columns. – Bring them all back together
19
Missing Data ##subset out the bad rows replacepeople = notypos[ missing < 6, ] ##note we are going to fudge a little bit dontpeople = notypos[ missing >= 6, ]
20
Missing Data ##figure out the columns to exclude replacecolumn = replacepeople[, -c(1, 3, 13)] dontcolumn = replacepeople[, c(1,3,13)]
21
Missing Data Now run mice! Set a temporary place holder: – tempnomiss = mice(DATASET) – tempnomiss = mice(replacecolumn) This function figures out what and how to replace for you.
22
Missing Data Now, put the replaced data back into your dataset. – nomiss = complete(tempnomiss, 1) complete(dataset you ran mice on, number < 10) – summary(nomiss)
23
Missing Data Put everything back together – We want to take our replaced data – And add back in our columns we couldn’t replace Dontcolumn filledin_none = cbind(dontcolumn, nomiss) – And add back in our rows we couldn’t replace Dontpeople filledin_missing = rbind(dontpeople, filledin_none)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.