Outliers Chapter 5.3 Data Screening
Outliers can Bias a Parameter Estimate
…and the Error associated with that Estimate
Outliers Outlier – case with extreme value on one variable or multiple variables Why? – Data input error – Not a population you meant to sample – From the population but has really long tails and very extreme values
Outliers Outliers – Two Types Univariate – for basic univariate statistics – Use these when you have ONE DV or Y variable. Multivariate – for some univariate statistics and all multivariate statistics – Use these when you have multiple continuous variables or lots of DVs.
Outliers Univariate In a normal z-distribution anyone who has a z- score of +/- 3 is less than.2% of the population. Therefore, we want to eliminate people who’s scores are SO far away from the mean that they are very strange.
Outliers Univariate outliers are fine and dandy, but you may have lots of data and don’t want to do each column one at a time. – Plus, the multivariate outlier analysis works just as well if it’s one column or 500, so let’s just do that.
Outliers Multivariate – Now we need some way to measure distance from the mean (because Z-scores are the distance from the mean), but the mean of means (or all the means at once!) Mahalanobis distance – Creates a distance from the centroid (mean of means)
Outliers Mahalanobis Centroid is created by plotting the 3D picture of the means of all the means and measuring the distance – Similar to Euclidean distance
Outliers Mahalanobis No set cut off rule – Use a chi-square table. – DF = # of variables (DVs, variables that you used to calculate Mahalanobis) – Use p<.001 NOTE: DF here has NOTHING to do with the DF for hypothesis testing.
Outliers So do I delete them? Yes: they are far away from the middle! No: they may not affect your analysis! It depends: I need the sample size! SO?! – Try it with and without them. See what happens. FISH!
Outliers Important side notes: – For ANOVA, t-tests, correlation: you will use a fake regression analyses – it’s considered fake because it’s not the real analysis, just a way to get the information you need to do data screening.
Outliers Important side notes: – For regression based tests: you can run the real regression analysis to get the same information. The rules are altered slightly, so make sure you make notes in the regression section on what’s different. You will also use other regression based values for this analysis.
Outliers Important side note: – Many functions in R have their own data screening options. This guide is for global screening not specific to one analysis.
Outliers First, figure out the factor columns, as all columns need to be int or num. – filledin_none[, -c(1,2)] – Use that dataset code in the next function.
Outliers Mahalanobis function mahalanobis( – Dataset name, – colMeans(dataset name, na.rm = TRUE), – cov(datasetname, use = “pairwise.complete.obs) – )
Outliers mahal = mahalanobis(filledin_none[, -c(1,2)], colMeans(filledin_none[, -c(1,2)], na.rm = TRUE), cov(filledin_none[, -c(1,2)], use="pairwise.complete.obs"))
Outliers Now, let’s get rid of people with bad scores – But what is a bad score? – Use a chi-square table. – DF = # of variables (DVs, variables that you used to calculate Mahalanobis) – Use p<.001 Oh, let’s make R do it.
Outliers Use the qchisq function, which finds the cut off score for you. – qchisq(1-pvalue, Number of columns) cutoff = qchisq(.999,ncol(dataset)) cutoff = qchisq(.999,ncol(filledin_none[, - c(1,2)]))
Outliers So, let’s see how many are bad – summary(mahal < cutoff) Let’s get rid of those peeps – noout = filledin_none[ mahal < cutoff, ]