Testing assumptions of simple linear regression

Slides:

Advertisements

Similar presentations

Items to consider - 3 Multicollinearity

Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.

Inference for Regression

1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce

5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.

LECTURE 3 Introduction to Linear Regression and Correlation Analysis

MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.

Copyright © 2010, 2007, 2004 Pearson Education, Inc. *Chapter 29 Multiple Regression.

Statistics for Managers Using Microsoft® Excel 5th Edition

Chapter 12 Simple Regression

Statistics for Managers Using Microsoft® Excel 5th Edition

Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review

Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.

Lecture 6: Multiple Regression

Multiple Regression – Assumptions and Outliers

Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.

Multiple Regression – Basic Relationships

Correlation and Regression Analysis

Regression Analysis We have previously studied the Pearson’s r correlation coefficient and the r2 coefficient of determination as measures of association.

Assumption of Homoscedasticity

SW388R6 Data Analysis and Computers I Slide 1 One-sample T-test of a Population Mean Confidence Intervals for a Population Mean.

Multiple Regression Dr. Andy Field.

Introduction to Regression Analysis, Chapter 13,

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.

Relationships Among Variables

SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.

Assumption of linearity

Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.

Slide 1 SOLVING THE HOMEWORK PROBLEMS Simple linear regression is an appropriate model of the relationship between two quantitative variables provided.

SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems.

Regression and Correlation Methods Judy Zhong Ph.D.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.

Chapter 13: Inference in Regression

LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.

SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.

1 Least squares procedure Inference for least squares lines Simple Linear Regression.

Stepwise Multiple Regression

MULTIPLE REGRESSION Using more than one variable to predict another.

Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.

Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide

The Goal of MLR  Types of research questions answered through MLR analysis:  How accurately can something be predicted with a set of IV’s? (ex. predicting.

EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.

Regression Analyses. Multiple IVs Single DV (continuous) Generalization of simple linear regression Y’ = b 0 + b 1 X 1 + b 2 X 2 + b 3 X 3...b k X k Where.

Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.

6/2/2016Slide 1 To extend the comparison of population means beyond the two groups tested by the independent samples t-test, we use a one-way analysis.

SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.

SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.

11/4/2015Slide 1 SOLVING THE PROBLEM Simple linear regression is an appropriate model of the relationship between two quantitative variables provided the.

SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.

SW318 Social Work Statistics Slide 1 One-way Analysis of Variance  1. Satisfy level of measurement requirements  Dependent variable is interval (ordinal)

© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.

Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.

Correlation & Regression Analysis

ANOVA, Regression and Multiple Regression March

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.

Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.

 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?

Applied Quantitative Analysis and Practices LECTURE#28 By Dr. Osman Sadiq Paracha.

Multiple Linear Regression An introduction, some assumptions, and then model reduction 1.

(Slides not created solely by me – the internet is a wonderful tool) SW388R7 Data Analysis & Compute rs II Slide 1.

Regression. Why Regression? Everything we’ve done in this class has been regression: When you have categorical IVs and continuous DVs, the ANOVA framework.

Stats Methods at IC Lecture 3: Regression.

BINARY LOGISTIC REGRESSION

Inference for Least Squares Lines

Multiple Regression Prof. Andy Field.

Multivariate Analysis Lec 4

Multiple Regression – Split Sample Validation

Presentation transcript:

Testing assumptions of simple linear regression Addendum Testing assumptions of simple linear regression

Now, how does one go about it? The approach taken in this course will be to teach you to control a In other words, teach cautious ways to go about your business, so that if you get a result you can interpret it appropriately This requires that you know what to do to protect a... ...and that means testing the assumptions of the procedure...and knowing what happens to a if they are violated

Now, how does one go about it? And just as a “by the way”... There’s lots of slides in here that we’ll “flash” by...but they provide a real step by step guide to completing some basic tests in the mid-term, so please be aware that the information is here!

Testing assumptions for simple regression - 1 Measurement level Independent must be interval or dichotomous Dependent must be interval How to test? You already know If condition violated? Don’t use regression!

Testing the assumptions for simple regression - 2 Normality (interval level variables) Skewness & Kurtosis must lie within acceptable limits (-1 to +1) How to test? You can examine a histogram, but SPSS also provides procedures, and these have convenient rules that can be applied (see following slides) If condition violated? Regression procedure can overestimate significance, so should add a note of caution to the interpretation of results (increases type I error rate)

Testing the assumptions - normality To compute skewness and kurtosis for the included cases, select Descriptive Statistics|Descriptives… from the Analyze menu.

Testing the assumptions - normality First, move the variables to the Variable(s) list box. In this case there are two interval variables (the IV and the DV) Second, click on the Options… button to specify the statistics to compute.

Testing the assumptions - normality Second, click on the Continue button to complete the options. First, mark the checkboxes for Kurtosis and Skewness.

Testing the assumptions - normality Click on the OK button to indicate the request for statistics is complete.

SPSS output to evaluate normality The simple linear regression requires that the interval level variables in the analysis be normally distributed. The skewness of NUMBER OF HOURS WORKED LAST WEEK for the sample (-0.333) is within the acceptable range for normality (-1.0 to +1.0) , but the kurtosis (1.007) is outside the range. The assumption of normality is not satisfied for NUMBER OF HOURS WORKED LAST WEEK. The skewness of RS OCCUPATIONAL PRESTIGE SCORE (1980) for the sample (0.359) is within the acceptable range for normality (-1.0 to +1.0) and the kurtosis (-0.692) is within the range. The assumption of normality is satisfied for RS OCCUPATIONAL PRESTIGE SCORE (1980). The assumption of normality required by the simple linear regression is not satisfied. A note of caution should be added to any findings based on this analysis.

Testing the assumptions for simple regression – 3 Linearity & homoscedasticity for interval level variables How to test? Scatterplot (see following slides) If condition violated? Can underestimate significance – loses power, increases possibility of type II error

Testing the assumptions – linearity and homoscedasticity To request a scatterplot, select the Scatter… command from the Graphs menu.

Testing the assumptions – linearity and homoscedasticity Second, click on the Define button to enter the specifications for the scatterplot. First, select the Simple template for the scatterplot.

Testing the assumptions – linearity and homoscedasticity First, move the dependent variable “hrs1" to the text box for the Y Axis. Third, click on the OK button to complete the request. Second, move the independent variable “prestg80" to the text box for the X Axis.

The scatterplot for evaluating linearity The simple linear regression assumes that the relationship between the independent variable RS OCCUPATIONAL PRESTIGE SCORE (1980)" and the dependent variable "NUMBER OF HOURS WORKED LAST WEEK" is linear. The assumption is usually evaluated by visual inspection of the scatterplot. Violation of the linearity assumption may result in an understatement of the strength of the relationship between the variables.

The scatterplot for evaluating homoscedasticity The simple linear regression assumes that the range of the variance for the dependent variable is uniform for all values of the independent variable. For an interval level independent variable, the assumption is evaluated by visual inspection of the scatterplot of the two variables. Violation of the homogeneity assumption may result in an understatement of the strength of the relationship between the variables.

Testing the assumptions for simple regression – 4 Linearity & homoscedasticity for a dichotomous independent variable How to test? Linearity –only 2 levels, so not relevant here (see next slide) Homoscedasticity – via Levene’s test of homogeneity of variance in ANOVA (see following slides) If condition violated? Can underestimate significance – loses power, increases possibility of type II error

Testing the assumptions – linearity for a dichotomous IV When the independent variable is dichotomous, we do not have a meaningful scatterplot that we can interpret for linearity. The assumption of a linear relationship between the independent and dependent variable is only tested when the independent variable is interval level.

Testing the assumptions - homoscedasticity for a dichotomous IV To conduct the test of homoscedasticity, we will use the One-Way ANOVA procedure. Select the command Compute Means | One-Way ANOVA … from the Analyze menu.

Testing the assumptions - homoscedasticity for a dichotomous IV First, move the variable “prestg80” to to the Dependent list box. Second, move the variable “compuse” to the Factor text box. Third, click on the Options… button to specify the statistics to compute.

Testing the assumptions - homoscedasticity for a dichotomous IV Second, click on the Continue button to complete the request. First, mark the Homogeneity-of-variance check box to request the Levene test.

Testing the assumptions - homoscedasticity for a dichotomous IV Click on the OK button to indicate the request for statistics is complete.

Result of test of homoscedasticity for a dichotomous independent variable The simple linear regression assumes that the variance for the dependent variable is uniform for all groups. This assumption is evaluated with Levene's test for equality of variances. The null hypothesis for this test states that the variances of all groups are equal. The desired outcome for this test is to fail to reject the null hypothesis. Since the probability associated with the Levene test (0.141) is greater than the level of significance (0.05), the null hypothesis is not rejected. The requirement for equal variances is satisfied.

Assumptions tested – run the analysis Now you’ve tested the assumptions, here’s a quick run through of how to run the test and how to interpret results First – an example with two interval level variables

Running the analysis – interval IV’s To conduct a simple linear regression, select the Regression | Linear… from the Analyze menu.

Running the analysis – interval IV’s First, move the dependent variable “hrs1" to the text box for the Dependent variable. Third, click on the OK button to complete the request. Second, move the independent variable “prestg80" to the list of Independent variables.

The existence of a relationship The determination of whether or not there is a relationship between the independent variable and the dependent variable is based on the significance of the regression in the ANOVA table. The probability of the F statistic for the regression relationship is 0.041, less than or equal to the level of significance of 0.05. We reject the null hypothesis that there is no relationship between the independent and the dependent variable.

The strength of the relationship The strength of the relationship is based on the R-square statistic, which is the square of the R, the correlation coefficient. We evaluate the strength of the relationship using the rule of thumb for interpreting R: Between 0 and ±0.20 - Very weak Between ±0.20 and ±0.40 - Weak Between ±0.40 and ±0.60 - Moderate Between ±0.60 and ±0.80 - Strong Between ±0.80 and ±1.00 - Very strong

The direction of the relationship The direction of the relationship, direct or inverse, is based on the sign of the B coefficient for the independent variable. Since 0.138 is positive, there is a positive relationship between occupational prestige and hours worked.

Interpret the intercept The intercept (Constant) is the position on the vertical y-axis where the regression line crosses the axis. It is interpreted as the value of the dependent variable when the value of the independent variable is zero. It is seldom a useful piece of information.

Interpret the slope The B coefficient of the independent variable is called the slope. It represents the amount of change in the dependent variable for a one-unit change in the independent variable. Each time that occupational prestige increases or decreases by one point, we would expect the subject to work 0.138 more or 0.138 fewer hours.

Significance test of the slope If there is no relationship between the variables, the slope would be zero. The hypothesis test of the slope tests the null hypothesis that the b coefficient, or slope, is zero. In simple linear regression, the significance of this test matches that of the overall test of relationship between dependent and independent variables. In multiple regression, the test of overall relationship will differ from the test of each individual independent variable.

Conclusion of the analysis For the population represented by this sample, there is a very weak relationship between "RS OCCUPATIONAL PRESTIGE SCORE (1980)" and "NUMBER OF HOURS WORKED LAST WEEK." Specifically, we would expect a one unit increase in occupational prestige score to produce a 0.138 increase in number of hours worked in the past week. Because of the earlier problems stated with normality, the statistical conclusion must be expressed with caution.

Running the analysis – mixed IV’s Now an example with an interval dependent variable and a dichotomous independent variable...

SPSS output to evaluate normality The simple linear requires that the interval level variables in the analysis be normally distributed. The skewness of RS OCCUPATIONAL PRESTIGE SCORE (1980) for the sample (0.324) is within the acceptable range for normality (-1.0 to +1.0) and the kurtosis (-0.817) is within the range. The assumption of normality is satisfied for RS OCCUPATIONAL PRESTIGE SCORE (1980).

The strength of the relationship The strength of the relationship is based on the R-square statistic in the Model Summary table of the regression output. R-square is the square of the R, the correlation coefficient. We evaluate the strength of the relationship using the rule of thumb for interpreting R: Between 0 and ±0.20 - Very weak Between ±0.20 and ±0.40 - Weak Between ±0.40 and ±0.60 - Moderate Between ±0.60 and ±0.80 - Strong Between ±0.80 and ±1.00 - Very strong

The direction of the relationship The direction of the relationship, direct or inverse, is based on the sign of the B coefficient for the independent variable. Since -7.406 is negative, there is an inverse relationship between using a computer and occupational prestige. What this means exactly will depend on the way the computer use variable is coded.

Interpret the intercept The intercept (Constant) is the position on the vertical y-axis where the regression line crosses the axis. It is interpreted as the value of the dependent variable when the value of the independent variable is zero. It is seldom a useful piece of information.

Interpret the slope The b coefficient for the independent variable "R USE COMPUTER" is -7.406. The b coefficient is the amount of change in the dependent variable "RS OCCUPATIONAL PRESTIGE SCORE (1980)" associated with a one unit change in the independent variable. Since the independent variable is dichotomous, a one unit increase implies a change from the category YES(code value = 1) to the category NO(code value = 2).

Significance test of the slope If there is no relationship between the variables, the slope would be zero. The hypothesis test of the slope tests the null hypothesis that the b coefficient, or slope, is zero. In simple linear regression, the significance of this test matches that of the overall test of relationship between dependent and independent variables. In multiple regression, the test of overall relationship will differ from the test of each individual independent variable.

Conclusion... For the population represented by this sample, there is a weak relationship between "R USE COMPUTER" and "RS OCCUPATIONAL PRESTIGE SCORE (1980)." Specifically, we would expect survey respondents who used a compute to average 7.406 less for occupational prestige score than survey respondents who worked part-time. No problems with assumptions, so no need to express caution in this case.

Simple linear regression chart - 1 The following is a guide to the decision process for answering simple linear regression questions. Is the level of measurement okay? Independent: interval or dichotomous Dependent: interval Incorrect application of a statistic No Yes Is the assumption of normality satisfied? Skewness, kurtosis of dependent variable: –1.0 to +1.0 No Add caution if the question turns out to be true Yes

Simple linear regression chart - 2 Is the assumption of linearity satisfied? Examine scatterplot No Add caution if the question turns out to be true Yes Is the assumption of homoscedasticity satisfied? Levene test for dichotomous independent variable Examine scatterplot for interval independent variable Add caution if the question turns out to be true No Yes

Simple linear regression chart - 3 Is the probability of the F for the regression relationship less than or equal to the level of significance? Fail to reject null hypothesis No Yes Does the size and direction of of the intercept and the slope agree with the problem statement? Fail to reject null hypothesis No Yes Reject null hypothesis

Multiple Linear Regression A practical guide, and how to test assumptions

First, what is multiple linear regression? First, some terminology…these 3 equations all say the same thing… 0, 1 and so on are called beta coefficients Y’ = a + bX Y’ = mx + b Y’ = 0 + 1X a = b = 0 = INTERCEPT b = m = 1 = SLOPE

First, what is multiple linear regression? Simple linear regression uses just one predictor or independent variable Multiple linear regression just adds more IV’s (or predictors) Each IV or predictor brings another beta coefficient with it… Y’ = 0 + 1X1 + 2X2

Note R2 = .65 for the simple model Now, an example… So, now we can add the sex variable to our prediction equation from last week Here is the one with just height in the model… R2 = .65 Note R2 = .65 for the simple model

Now, an example… But if we add sex… The slope of each line is the same, but it now fits both values of sex by adjusting the height of the line R2 = .99 Nice improvement in R2!

Now, an example… In terms of the equation, this is achieved by… When sex = 1: Y’ = 0 + 1X1 + (2*1) When sex = 2: Y’ = 0 + 1X1 + (2*2)

Now, an example… This is called “dummy coding” when the second variable is categorical (as sex is) The principle is similar when the second variable is continuous Adding more variables simply captures more variance on the dependent variable (potentially, of course)

Note on graphs/charts for MLR I showed you the example in 2D, but with multiple regression an accurate chart is only possible in the number of dimensions equal to the total number of variables in the model (dependent plus independent) So, three dimensions would be needed here Y’ = 0 + 1X1 + 2X2

Y regression surface 0,1,1 X2 0,0,0 X1 0,1,0

0,1.5,1 2,1,1 0,.5,0 1,.5,0 Y regression surface 1,0,1 X2 0,0,0 X1 1,0,0

Assumptions of MLR Four assumptions of MLR (known by acronym “LINE”) Linearity: the residuals (differences between the obtained and predicted DV scores) should have a straight-line relationship with predicted DV scores Independence: the observations on the DV are uncorrelated with each other

Regression Assumptions Four assumptions of MLR: Normality: the observations on the DV are normally distributed for each combination of values for the IV’s Equality of variance: the variance of the residuals about predicted DV scores should be the same for all predicted scores (homoscedasticity…remember the cone shaped pattern?) We will not test MLR assumptions in this class (enough that you do them for SLR)

Other issues Sample size & # predictors: A crucial aspect of the worth of a prediction equation is whether it will generalize to other samples With multiple regression (based on multiple correlation) minimizing the prediction errors of the regression line is like maximizing the correlation for that sample So one would expect that on another sample, the correlation (and thus R2) would shrink

Other issues Sample size & # predictors: Our problem is reducing the risk of shrinkage Two most important factors: Sample size (n) Number of predictors (independent variables) (k) Expect big shrinkage with ratios less than 5:1 (n:k) Guttman (1941): 136 subjects, 84 predictors, obtained multiple r = .73 On new independent sample, r = .04! Stevens (1986): n:k should be 15:1 or greater in social science research Tabachnick & Fidell (1996): n>50+8k

Other issues Sample size & # predictors: What to do if you violate these rules: Report Adjusted R2 in addition to R2 when your sample size is too small, or close to it…small samples tend to overestimate R (and consequently R2)

Other Issues Outliers MLR is very sensitive to outliers Check for outliers in the initial data screening process What to do with outliers, when found, is a highly controversial topic: Leave Delete Transform

Other Issues Outliers and influential points are not necessarily the same thing Need to sort out whether an outlier is influential You can look for them, & test for them, in normal regression procedures

Other Issues List data – check for errors Measure influential data points Cook’s distance (CD) A measure of change in regression coefficients that would occur if the case was omitted…reveals the cases most influential for the regression equation CD of 1 is generally thought to be large

Other Issues Multicollinearity The relationship between IV’s…when IV’s are highly correlated with one another Results in a model that overestimates how much variance is truly being explained Examine the correlation matrix produced by SPSS to detect any multicollinearity If detected, it is generally best to re-run MLR and eliminate one of the two IV’s from the model

Multicollinearity Nasty word – but relatively simple meaning Multicollinearity = high correlation among predictors Consider these: A B C X1 X2 X3 Y .2 .1 .3 .5 .4 .6 X1 X2 X3 Y .6 .5 .7 .2 .3 X1 X2 X3 Y .6 .7 .8 Which would we expect to have the largest overall R2, and which would we expect to have the smallest?

Multicollinearity R will be at least .7 for B & C, but only at least .3 for A No chance of R for A getting much larger, because intercorrelations of X’s are as large for A as for B & C A B C X1 X2 X3 Y .2 .1 .3 .5 .4 .6 X1 X2 X3 Y .6 .5 .7 .2 .3 X1 X2 X3 Y .6 .7 .8

Multicollinearity R will probably be largest for B Predictors are correlated with Y Not much redundancy among predictors R probably greater in B than C, as C has considerable redundancy in predictors A B C X1 X2 X3 Y .2 .1 .3 .5 .4 .6 X1 X2 X3 Y .6 .5 .7 .2 .3 X1 X2 X3 Y .6 .7 .8

Multiple correlation was only .46 Multicollinearity Real-world example (Dizney & Gromen, 1967) x1 = reading proficiency; x2 = writing proficiency; y = course in college German x1 x2 y 1 .58 .33 X2 .45 Multiple correlation was only .46

Multiple predictors – unique vs. shared variance Must establish what unique variance on each predictor is related to variance on criterion Example 1 (graphical): y – freshman college GPA predictor 1 – high school GPA predictor 2 – SAT total score predictor 3 – attitude toward education

Multiple predictors – unique vs. shared variance Circle = variance for a variable; overlap = shared variance (only 2 predictors shown here) y variance in y accounted for by predictor 2 after the effect of predictor 1 has been partialled out x2 Common variance in y that both predictors 1 and 2 account for x1

Multiple predictors – unique vs. shared variance Example 2 (words): y – freshman college GPA predictor 1 – high school GPA predictor 2 – SAT total score predictor 3 – attitude toward education Aaaaaagggghhhhhhhh!!!!

Multiple predictors – unique vs. shared variance = variance in college GPA predictable from variance in high school GPA = residual variance in SAT related to variance in college GPA = residual variance in attitude related to variance in college GPA

The Goal of MLR The big picture… What we’re trying to do is create a model predicting a DV that explains as much of the variance in that DV as possible, while at the same time: Meet the assumptions of MLR Best manage the other issues – sample size, outliers, multicollinearity Be parsimonious

The Goal with MLR Types of research questions answered through MLR analysis: How accurately can something be predicted with a set of IV’s? (ex. predicting incoming students’ GPA) What is the relationship between a certain IV and the DV, while simultaneously controlling for confounding variables? (ex. relationship between TV broadcasting and game attendance while controlling for day of week, time of day, ticket price, teams’ won-loss records, etc…)

Entering IV’s into the MLR Model Standard regression Full model approach=all IV’s used in model at same time Some researchers will simply end the analysis at this point, others will examine assumptions, outliers, multicollinearity and create reduced models until they identify a single “best” model to predict the DV This is the process we will use in this class, although testing assumptions is beyond our purposes here

Entering IV’s into the MLR Model Sequential regression Sometimes called hierarchical regression The researcher determines an order to entering variables into the regression equation, usually based on some sort of theoretical or logical rationale The researcher then makes judgments as to the unique contributions of each IV in determining which to include & which to delete from the model Not radically different than the previous process of reducing the full model, just a slightly different process – sequential is sometimes thought to be prone to researcher bias in its subjectivity

Entering IV’s into the MLR Model Statistical (often called Stepwise) regression IV’s are entered into the regression equation one at a time and is assessed in terms of what it adds to the equation at that point of entry and for how it impacts multicollinearity & assumptions Forward regression=IV with highest correlation to DV added first, then the second-highest, etc… Backward regression=All variables entered, then one with lowest correlation to DV taken away, then the second-lowest, etc… Stepwise regression =A combination of alternating forward and backward regression procedures The use of stepwise regression is highly controversial in statistics

Order of entry of predictors In real life, predictors will be at least a little bit correlated with each other So, when second predictor is added, it only adds to the total R2 by the amount of unique variance it has with the criterion So it can make a big difference what predictor is entered first What determines the size of difference?

Order of entry of predictors Example (real, again – Crowder, 1975): Predict ratings of trainability of mentally retarded (TM) individuals based upon IQ and a test of social inference (TSI) rIQTSI = .59; rTMIQ = .54; rTMTSI = .566 1st ordering % of variance TSI 32.04 IQ 6.52 2nd ordering % of variance IQ 29.16 TSI 9.40

Work with small # of predictors Why? Parsimony (a principle of good science) Small # predictors improves n/k ratio Greater # predictors, higher chance of shared variance among them Large number of predictors can be replaced with smaller number of “constructs” -“principal components” – see your next stats class…(factor analysis)

Regression procedures Forward selection 1st in will be the predictor with the largest correlation with y Then the one with the largest semi-partial correlation…and so on… Stops when remaining predictors are no longer significant

Regression procedures Backward selection All predictors in Partial F for each predictor calculated (as though it were the last one in) Smallest partial F compared with pre-selected threshold. If smaller, then remove When the smallest F is still more significant than the threshold, stop

Regression procedures Stepwise regression Variation on forward selection At each stage, test is made of most superfluous predictor (smallest F) Popular Easily abused…see next slide

Regression procedures “…studies have shown that in some cases when an entry F test was made at the -level, the appropriate probability was q*, where there were q entry candidates at that stage” Draper & Smith (1981) In other words, if you have 10 predictors, and you think you are testing at .05, you might be testing at .50! Emphasizes need for good theory and justification of model, and subsequent rigorous testing of it

Checking assumptions of the model Residual plots Plot studentized residuals against predicted values If model is good, should give Studentized residuals Random variation of residuals about 0 - good Predicted y’s

Checking assumptions of the model Residual plots Bad models: Violation: Non-linearity Violation: non-constant variance Violation: Non-linearity and non-constant variance

Checking assumptions of the model Residual plots And in SPSS you do it like this… Analyze_regression_linear In the dialogue box, choose “plots” Make Y = *SRESID; X = *ZPRED And you get something like this…

Checking assumptions of the model Doesn’t look too good, does it?

Checking assumptions of the model Cross-validation The single most powerful and most useful check against shrinkage Split sample 80:20 Run regression procedure on 80% of sample (derivation or screening sample) Apply prediction equation to other 20% (validation or calibration) (to do this, you can use compute in SPSS to calculate a new variable which is a linear composite of the predictors) If there is little drop off, your cross-validation has been successful

The MLR Process Graph the data to look for outliers Examine the correlation matrix to see which variables are correlating with the DV and for multicollinearity among IV’s Find the R2 and Std. Error #’s from the “Model Summary” The “ANOVA” box performs an F-test to accept/reject the null hypothesis that none of the IV’s is purposeful in predicting the DV The p-value is the far right column, but also know that the F-statistic next to it is the measure used to derive the p-value

The MLR Process The coefficients box provides data on each of the IV’s while all are in the model at the same time…essentially provides an examination of each IV’s unique contribution to predicting the DV while controlling for all of the other IV’s The p-value (Sig.) tells us whether each IV is a significant predictor of the DV (compare each to alpha level) The standardized beta coefficient allows us to compare the strength of each IV in predicting the DV The unstandardized beta coefficients represent the slope and y-intercept of the regression line equation…these can be used as we did last week with SLR to predict the DV knowing the value of the IV’s Am I satisfied with this model, or should I examine another model by reducing via IV elimination (again based on assumptions, outliers, multicollinearity, and the R2 value of the model)

The MLR Process We’ll look over this last question next week with the final set of slides, designed to show how to play with model reduction (it really is basically just mucking about with the model until you think you have it right) I’ll use a modified set of slides from the last few slides to follow this one, to lecture from next week.

MLR Example Let’s go through this process in an example… RQ: What is the relationship between television broadcasting, both national and local, and game attendance in men’s college basketball (while controlling for potentially confounding variables)? Let’s establish alpha=.05 a priori

MLR Example Graph the data to look for outliers

MLR Example Variable Attendance NATBROAD .436* LOCBROAD .343* MON -.035 TUES .020 WED .041 THURS -.081* FRI -.099* SAT .064* SUN .024 TIME -.088* HOMEWIN .455* VISWIN .302* RPI0203 -.442* RPI0304 -.388* CONFGAME .030 CONFRPI .712* HISTORY .790* UNIVTYPE -.097* ENROLL .384* POPUL -.050 INCOME .054 Examine the correlation matrix to see which variables are correlating with the DV and for multicollinearity among IV’s Note: the entire correlation matrix would not fit here as it is 21x21 variables Very little multicollinearity…no pairwise correlations above r=.45 and few even above r=.3

MLR Example Find the R2 and Std. Error #’s from the “Model Summary” What can we learn here?

MLR Example The “ANOVA” box performs an F-test to accept/reject the null hypothesis that none of the IV’s is purposeful in predicting the DV

MLR Example The coefficients box provides data on each of the IV’s while all are in the model at the same time…essentially provides an examination of each IV’s unique contribution to predicting the DV while controlling for all of the other IV’s The p-value (Sig.) tells us whether each IV is a significant predictor of the DV (compare each to alpha level) The standardized beta coefficient allows us to compare the strength of each IV in predicting the DV The unstandardized beta coefficients represent the slope and y-intercept of the regression line equation…these can be used as we did last week with SLR to predict the DV knowing the value of the IV’s

MLR Example

MLR Example The coefficients box (continued) Which variables correlate significantly with game attendance while controlling for other variables? Which variables are the strongest predictors of game attendance? How much does attendance increase or decrease if a game is broadcast on national television? on local television?

MLR Example Am I satisfied with this model, or should I examine another model by reducing via IV elimination? No data reduction was used with this study because of the research question being addressed (use of most of the IV’s merely for control purposes) Ignoring that, what reduced models might we want to examine if we were so inclined?

MLR Example Conclusions? In other words, how would we phrase our interpretation of the findings of this study?

Other Prediction Statistics Canonical correlation can be used to predict the value of two or more DV’s with two or more IV’s Ex. Using HS GPA & SAT to predict undergrad GPA and graduation rate

Other Prediction Statistics Discriminant analysis/function can be used to predict classification into categories of the DV with two or more IV’s Ex. Using IV’s like age, education, hours worked per week, and income to predict one’s life perspective (dull, routine, or exciting)

Other Prediction Statistics Logistic regression can be used to predict classification into a dichotomous DV with two or more IV’s Ex. Using variables to predict steroid user or non-user