Assumptions 5.4 Data Screening. Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality.

Slides:



Advertisements
Similar presentations
Repeated Measures/Mixed-Model ANOVA:
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Inference for Regression
Copyright © 2010 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. *Chapter 29 Multiple Regression.
Psych 524 Andrew Ainsworth Data Screening 1. Data check entry One of the first steps to proper data screening is to ensure the data is correct Check out.
Lecture 6: Multiple Regression
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Business Statistics - QBM117 Statistical inference for regression.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Assumption of Homoscedasticity
Chapter 12 Section 1 Inference for Linear Regression.
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
Today: Central Tendency & Dispersion
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Linear Regression Inference
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Soc 3306a Lecture 9: Multivariate 2 More on Multiple Regression: Building a Model and Interpreting Coefficients.
SEM: Basics Byrne Chapter 1 Tabachnick SEM
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
 Slide 1 Two-Way Independent ANOVA (GLM 3) Chapter 13.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
The Beast of Bias Data Screening Chapter 5. Bias Datasets can be biased in many ways – but here are the important ones: – Bias in parameter estimates.
Comparing Two Means Chapter 9. Experiments Simple experiments – One IV that’s categorical (two levels!) – One DV that’s interval/ratio/continuous – For.
ANOVA: Analysis of Variance.
Experimental Design and Statistics. Scientific Method
SW318 Social Work Statistics Slide 1 One-way Analysis of Variance  1. Satisfy level of measurement requirements  Dependent variable is interval (ordinal)
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.
» So, I’ve got all this data…what now? » Data screening – important to check for errors, assumptions, and outliers. » What’s the most important? ˃Depends.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Linear Regression Chapter 8. Slide 2 What is Regression? A way of predicting the value of one variable from another. – It is a hypothetical model of the.
Outliers Chapter 5.3 Data Screening. Outliers can Bias a Parameter Estimate.
Comparing Two Means Chapter 9. Experiments Simple experiments – One IV that’s categorical (two levels!) – One DV that’s interval/ratio/continuous – For.
Descriptions. Description Correlation – simply finding the relationship between two scores ○ Both the magnitude (how strong or how big) ○ And direction.
ANOVA, Regression and Multiple Regression March
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Beginning Statistics Table of Contents HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2008 by Hawkes Learning Systems/Quant Systems, Inc.
Linear Regression Chapter 7. Slide 2 What is Regression? A way of predicting the value of one variable from another. – It is a hypothetical model of the.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
PROFILE ANALYSIS. Profile Analysis Main Point: Repeated measures multivariate analysis One/Several DVs all measured on the same scale.
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.
Correlation & Simple Linear Regression Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU 1.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Exploring Group Differences
CHAPTER 12 More About Regression
Inference for Least Squares Lines
Psych 706: stats II Class #4.
Multiple Regression.
CHAPTER 12 More About Regression
Chapter 12: Regression Diagnostics
Applied Statistical Analysis
Regression.
Multiple Regression A curvilinear relationship between one variable and the values of two or more other independent variables. Y = intercept + (slope1.
CHAPTER 12 More About Regression
CHAPTER 12 More About Regression
Presentation transcript:

Assumptions 5.4 Data Screening

Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality something or other – Homogeneity (Sphericity), Homoscedasticity

Independence The errors in your model should not be related to each other. If this assumption is violated: – Confidence intervals and significance tests will be invalid.

Additivity and Linearity The outcome variable is, in reality, linearly related to any predictors. If you have several predictors then their combined effect is best described by adding their effects together. If this assumption is not met then your model is invalid.

Additivity One problem with additivity = multicolllinearity/singularlity – The idea that variables are too correlated to be used together, as they do not both add something to the model.

Correlation This analysis will only be necessary if you have multiple continuous variables Regression, multivariate statistics, repeated measures, etc. You want to make sure that your variables aren’t so correlated the math explodes.

Correlation Multicollinearity = r >.90 Singularity = r >.95

Correlation Run a bivariate correlation on all the variables Look at the scores, see if they are too high If so: – Combine them (average, total) – Use one of them Basically, you do not want to use the same variable twice  reduces power and interpretability

Additivity: Check Use the cor() function to check correlations – correlations = cor(dataset name with no factors, use = “pairwise.complete.obs”) – correlations = cor(noout[,-c(1,2)], use="pairwise.complete.obs")

Additivity: Check Whoa! Yikes! Use the symnum() functions to view. symnum(correlations) – Look for a * or B

Linearity Assumption that the relationship between variables is linear (and not curved). Most parametric statistics have this assumption (ANOVAs, Regression, etc.).

Linearity Univariate You can create bivariate scatter plots and make sure you don’t see curved lines or rainbows. – Ggplot2! – Damn that would take forever!

Linearity Multivariate – all the combinations of the variables are linear (especially important for multiple regression and MANOVA) Much easier – allows to check everything at once. – If this analysis is really bad, I’d go back to check the bivariate scatter plots to see if it’s one variable. Or run nonparametrics.

Linearity: Check A fake regression to the rescue! – This analysis will let us check all the rest of the assumptions. – It’s fake because we aren’t doing a real hypothesis test.

Fake Regression A quick note: For many of the statistical tests you would run, there are diagnostic plots / assumptions built into them. This guide lets you apply data screening to any analysis, if you wanted to learn one set of rules, rather than one for each analysis. (BUT there are still things that only apply to ANOVA that you’d want to add when you run ANOVA).

Fake Regression First, let’s create a random variable: – We will use the chi-square distribution function. – Why chi-square? Mahalanobis used chi-square too…what gives?

Fake Regression For many of these assumptions, the errors should be chi-square distributed (aka lots of small errors, only a few big ones). However, the standardized errors should be normally distributed around zero. (don’t get these two things confused – we want the actual error numbers to be chi-square distributed, the zscored ones to be normal). Draw a picture.

Fake Regression Create a random chi-square with the same number of participants as our data. rchisq(number of random things, df) random = rchisq( nrow(noout), ##number of people 7) ##magic number

Fake Regression Now what do I do with that? – Run a fake regression with the new random variable as the DV. – Use the lm() function.

Fake Regression Lm arguments: – lm(y~x, data=data) (loads more options, here’s the ones you need). – Y = DV – X = IV In this example only we can use a. To represent all the columns. Normally you would have to type them out by column name. – Data = data set name

Fake Regression fake = lm(random~., data=noout) I saved it as fake to be able to view the diagnostic plots.

Linearity: Check Now that I have that done, let’s make the linearity plot – called a normal probability plot. Or just a PP Plot.

The P-P Plot

Linearity: Check What is this thing plotting? – The standardized residuals (draw). – These are zscored values of how far away a person’s predicted score is from their actual score. – We want to use zscores because they make it easy to interpret and give us probabilities.

Linearity: Check Get the standardized residuals out of your fake regression: – standardized = rstudent(fake) Plot that stuff: – qqnorm(standardized) Add a line to make it easy to interpret – abline(0,1)

Normally Distributed Something or Other This assumption tends to get incorrectly translated as ‘your data need to be normally distributed’.

Normally Distributed Something or Other We actually assume the sampling distribution is normal. – So if our sample is not then that’s ok, as long as we have enough people to meet the central limit theorem. How can we tell? – N > 30 – OR – Check out the sample distribution as an approximation.

When does the Assumption of Normality Matter? In small samples. – The central limit theorem allows us to forget about this assumption in larger samples. In practical terms, as long as your sample is fairly large, outliers are a much more pressing concern than normality.

Normality Univariate – the individual variables are normally distributed – Check for univariate normality with histograms – And skew and kurtosis values.

Normality Get skew and kurtosis: – Use the moments package, it’s happiness. Code: – skewness(dataset, na.rm=TRUE) – kurtosis(dataset, na.rm=TRUE) Our example – skewness(noout[, -c(1,2)], na.rm=TRUE) – kurtosis(noout[, -c(1,2)], na.rm=TRUE)

Normality What do these numbers mean? – You are looking for values that are less than the absolute value of 3 – same rule as univariate outliers. One variable has bad kurtosis values. – Generally, since we have enough people, I’d ignore this value. – But it can be helpful in figuring out why the next graph is bad.

Normality Multivariate – all the linear combinations of the variables need to be normal Basically if you ran the Mahalanobis analysis – you want to analyze multivariate normality.

Normality: Check We are going to use those standardized residuals again to check out normality. – hist(standardized, breaks=15)

Normality: Check What to look for: – See the numbers centered around zero at the bottom? – You want an even spread around zero … so it shouldn’t look like -2 to 0 to +4 … that’s not even.

Homogeneity Assumption that the variances of the variables are roughly equal. Ways to check – you do NOT want p <.001: – Levene’s - Univariate – Box’s – Multivariate – We will do these with the analyses they match up to.

Homogeneity Sphericity – the assumption that the time measurements in repeated measures have approximately the same variance Difficult assumption… – We will use Mauchley’s test when we get to repeated measures.

Homogeneity Slide 39

Homoscedasticity Spread of the variance of a variable is the same across all values of the other variable – Can’t look like a snake ate something or megaphones. Best way to check both of these is by looking at a residual scatterplot.

Spotting problems with Homogeneity or Homoscedasticity

Homog+s: Check Create a scatterplot of the fake regression. – X = standardized Fitted values = the predicted score for a person in your regression. – Y = standardized Residuals = the difference between the predicted score and a person’s actual score in the regression (y – y hat). – Make them both standardized for an easier scale to interpret.

Homog+s: Check We are plotting them against each other. In theory, the residuals should be randomly distributed (hence why we created a random variable to test with). Therefore, they should look like a bunch of random dots (see below).

Homog+s: Check Make the fit values standardized – fitvalues = scale(fake$fitted.values) Plot those values – plot(fitvalues, standardized) – abline(0,0)

Homog+s: Check Homogeneity – is the spread above that line the same as below that 0, 0 line (both directions)? – You do not want a very large spread on one side and a small spread on the other side (looks like it’s raining).

Homog+s: Check Homoscedasticity – is the spread equal all the way across the zero line? – Look for megaphones or big lumps. – It should look like a bunch of random dots. You do not want shapes. You can draw an imaginary line around all the dots. Should be a blob or block of dots.