Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.

Slides:



Advertisements
Similar presentations
Classical Regression III
Advertisements

Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Multivariate Distance and Similarity Robert F. Murphy Cytometry Development Workshop 2000.
Statistics for the Social Sciences Psychology 340 Fall 2006 Review For Exam 1.
1 Confidence Interval for the Population Mean. 2 What a way to start a section of notes – but anyway. Imagine you are at the ground level in front of.
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Inferential Statistics: SPSS
Hypothesis Testing:.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Education 793 Class Notes T-tests 29 October 2003.
Modern Languages Row A Row B Row C Row D Row E Row F Row G Row H Row J Row K Row L Row M
EDUC 200C Friday, October 26, Goals for today Homework Midterm exam Null Hypothesis Sampling distributions Hypothesis testing Mid-quarter evaluations.
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2015 Room 150 Harvill.
Vegas Baby A trip to Vegas is just a sample of a random variable (i.e. 100 card games, 100 slot plays or 100 video poker games) Which is more likely? Win.
MULTIPLE REGRESSION Using more than one variable to predict another.
Paired-Sample Hypotheses -Two sample t-test assumes samples are independent -Means that no datum in sample 1 in any way associated with any specific datum.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
Chapter 9: Non-parametric Tests n Parametric vs Non-parametric n Chi-Square –1 way –2 way.
SEM: Basics Byrne Chapter 1 Tabachnick SEM
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Stats Lunch: Day 4 Intro to the General Linear Model and Its Many, Many Wonders, Including: T-Tests.
Central Tendency and Variability Chapter 4. Variability In reality – all of statistics can be summed into one statement: – Variability matters. – (and.
Inference for Regression Chapter 14. Linear Regression We can use least squares regression to estimate the linear relationship between two quantitative.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Accuracy Chapter 5.1 Data Screening. Data Screening So, I’ve got all this data…what now? – Please note this is going to deviate from the book a bit and.
The Beast of Bias Data Screening Chapter 5. Bias Datasets can be biased in many ways – but here are the important ones: – Bias in parameter estimates.
Comparing Two Means Chapter 9. Experiments Simple experiments – One IV that’s categorical (two levels!) – One DV that’s interval/ratio/continuous – For.
Chapter 9 Introduction to the t Statistic. 9.1 Review Hypothesis Testing with z-Scores Sample mean (M) estimates (& approximates) population mean (μ)
Review. Statistics Types Descriptive – describe the data, create a picture of the data Mean – average of all scores Mode – score that appears the most.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Estimation of a Population Mean
CFA: Basics Beaujean Chapter 3. Other readings Kline 9 – a good reference, but lumps this entire section into one chapter.
» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Assumptions 5.4 Data Screening. Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality.
SEM Basics 2 Byrne Chapter 2 Kline pg 7-15, 50-51, ,
SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.
Lecturer’s desk INTEGRATED LEARNING CENTER ILC 120 Screen Row A Row B Row C Row D Row E Row F Row G Row.
Comparing Two Means Chapter 9. Experiments Simple experiments – One IV that’s categorical (two levels!) – One DV that’s interval/ratio/continuous – For.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
ALISON BOWLING STRUCTURAL EQUATION MODELLING. WHAT IS SEM? Structural equation modelling is a collection of statistical techniques that allow a set of.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
AGENDA Review In-Class Group Problems Review. Homework #3 Due on Thursday Do the first problem correctly Difference between what should happen over the.
AP Statistics.  If our data comes from a simple random sample (SRS) and the sample size is sufficiently large, then we know that the sampling distribution.
Statistics: Unlocking the Power of Data Lock 5 Section 6.4 Distribution of a Sample Mean.
STATS 10x Revision CONTENT COVERED: CHAPTERS
CFA Model Revision Byrne Chapter 4 Brown Chapter 5.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
Hypothesis test flow chart
Correlation  We can often see the strength of the relationship between two quantitative variables in a scatterplot, but be careful. The two figures here.
Statistical Analysis – Chapter 8 “Confidence Intervals” Roderick Graham Fashion Institute of Technology.
Presentation on DATA GENERATOR FOR WINDOWS. DATA GENERATOR FOR WINDOWS.
Between-Groups ANOVA Chapter 12. Quick Test Reminder >One person = Z score >One sample with population standard deviation = Z test >One sample no population.
Welcome to MM570 Psychological Statistics Unit 5 Introduction to Hypothesis Testing Dr. Ami M. Gates.
Central Tendency and Variability Chapter 4. Variability In reality – all of statistics can be summed into one statement: – Variability matters. – (and.
The Paired-Samples t Test Chapter 10. Research Design Issues >So far, everything we’ve worked with has been one sample One person = Z score One sample.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
The Statistical Imagination Chapter 7. Using Probability Theory to Produce Sampling Distributions.
The Single-Sample t Test Chapter 9. t distributions >Sometimes, we do not have the population standard deviation, σ. Very common! >So what can we do?
Introduction to Statistics for the Social Sciences SBS200, COMM200, GEOG200, PA200, POL200, or SOC200 Lecture Section 001, Spring 2016 Room 150 Harvill.
Section 11.1 Inference for the Mean of a Population AP Statistics March 15, 2010 CASA.
Unit 9: Dealing with Messy Data I: Case Analysis
Hypothesis Tests: One Sample
Applied Statistical Analysis
Regression.
Hypothesis Testing in the Real World
Presentation transcript:

Outliers Chapter 5.3 Data Screening

Outliers can Bias a Parameter Estimate

…and the Error associated with that Estimate

Outliers Outlier – case with extreme value on one variable or multiple variables Why? – Data input error – Not a population you meant to sample – From the population but has really long tails and very extreme values

Outliers Outliers – Two Types Univariate – for basic univariate statistics – Use these when you have ONE DV or Y variable. Multivariate – for some univariate statistics and all multivariate statistics – Use these when you have multiple continuous variables or lots of DVs.

Outliers Univariate In a normal z-distribution anyone who has a z- score of +/- 3 is less than.2% of the population. Therefore, we want to eliminate people who’s scores are SO far away from the mean that they are very strange.

Outliers Univariate outliers are fine and dandy, but you may have lots of data and don’t want to do each column one at a time. – Plus, the multivariate outlier analysis works just as well if it’s one column or 500, so let’s just do that.

Outliers Multivariate – Now we need some way to measure distance from the mean (because Z-scores are the distance from the mean), but the mean of means (or all the means at once!) Mahalanobis distance – Creates a distance from the centroid (mean of means)

Outliers Mahalanobis Centroid is created by plotting the 3D picture of the means of all the means and measuring the distance – Similar to Euclidean distance

Outliers Mahalanobis No set cut off rule  – Use a chi-square table. – DF = # of variables (DVs, variables that you used to calculate Mahalanobis) – Use p<.001 NOTE: DF here has NOTHING to do with the DF for hypothesis testing.

Outliers So do I delete them? Yes: they are far away from the middle! No: they may not affect your analysis! It depends: I need the sample size! SO?! – Try it with and without them. See what happens. FISH!

Outliers Important side notes: – For ANOVA, t-tests, correlation: you will use a fake regression analyses – it’s considered fake because it’s not the real analysis, just a way to get the information you need to do data screening.

Outliers Important side notes: – For regression based tests: you can run the real regression analysis to get the same information. The rules are altered slightly, so make sure you make notes in the regression section on what’s different. You will also use other regression based values for this analysis.

Outliers Important side note: – Many functions in R have their own data screening options. This guide is for global screening not specific to one analysis.

Outliers First, figure out the factor columns, as all columns need to be int or num. – filledin_none[, -c(1,2)] – Use that dataset code in the next function.

Outliers Mahalanobis function mahalanobis( – Dataset name, – colMeans(dataset name, na.rm = TRUE), – cov(datasetname, use = “pairwise.complete.obs) – )

Outliers mahal = mahalanobis(filledin_none[, -c(1,2)], colMeans(filledin_none[, -c(1,2)], na.rm = TRUE), cov(filledin_none[, -c(1,2)], use="pairwise.complete.obs"))

Outliers Now, let’s get rid of people with bad scores – But what is a bad score? – Use a chi-square table. – DF = # of variables (DVs, variables that you used to calculate Mahalanobis) – Use p<.001 Oh, let’s make R do it.

Outliers Use the qchisq function, which finds the cut off score for you. – qchisq(1-pvalue, Number of columns) cutoff = qchisq(.999,ncol(dataset)) cutoff = qchisq(.999,ncol(filledin_none[, - c(1,2)]))

Outliers So, let’s see how many are bad – summary(mahal < cutoff) Let’s get rid of those peeps – noout = filledin_none[ mahal < cutoff, ]