Class 21: Tues., Nov. 23 Today: Multicollinearity, One-way analysis of variance Schedule: –Tues., Nov. 30 th – Review, Homework 8 due –Thurs., Dec. 2 nd – Midterm II –Tues., Dec. 7 th, Thurs., Dec. 9 th – Analysis of variance continued –Mon., Dec. 13 th – Rough draft of project due –Tues., Dec. 21 st – Final draft of project, Homework 9 due
Multicollinearity Multicollinearity in multiple regression refers to situation in which two or more explanatory variables are highly correlated. Multicollinearity between X 1 and X 2 makes it difficult to distinguish the effect of X 1 on Y holding X 2 fixed (coefficient on X 1 ) and effect of X 2 on Y holding X 1 fixed (coefficient on X 2 ). Multicollinearity leads to high standard errors of estimated coefficients.
House Prices Example A real estate agent wanted to develop a model to predict the selling price of a home. The agent believed that the most important variables in determining the price of a house are its size, number of bedrooms, and lot size. Accordingly the agent took a random sample of 100 recently sold homes and recorded the selling price (Y), the number of bedrooms (X 1 ), the size in square feet (X 2 ) and the lot size in square feet (X 3 ). The data is in housesellingprice.JMP
The p-value for the overall F-test is < Strong evidence that the model is useful – it provides better predictions than the sample mean of selling price. But none of the explanatory variables are useful for predicting Y once the other explanatory variables are taken into account –the p-values for the t-tests are all >0.05. What is happening?
Multicollinearity: House size and lot size are highly correlated (correlation = ). House size and bedrooms (0.8465) and lot size and bedroorms (0.8374) also fairly highly correlated.
Variance Inflation Factors: Method for Recognizing Multicollinearity Variance inflation factors (VIF): An index of the effect of multicollinearity on the standard error of coefficient estimates in regression. A VIF of 9 for explanatory variable X 1 implies that the standard error of the coefficient estimate of X 1 is times larger than it would be if X 1 was uncorrelated with other explanatory variables. If a coefficient has a VIF of greater than 10, that indicates that multicollinearity is having a substantial impact on it. VIFs in JMP. After Fit model, go to area of parameter estimates, right click, click Columns and then click VIF.
Coefficients on house size and lot size are being strongly affected by multicollinearity.
Methods for Dealing with Multicollinearity 1.Suffer: If prediction within the range of the data (interpolation) is the only goal, then leave it alone. Make sure, however, that the observations to be predicted are comparable to those used to construct the model – you will get very wide prediction intervals otherwise. 2.Transform or combine: In some case, we can transform or combine two variables that are highly correlated. 3.Omit one: If X 1 and X 2 are highly correlated, we can omit X 1. But note that coefficient on X 2 after X 1 is omitted has different interpretation. There is no good solution to multicollinearity if we want to understand the effect of X 2 on Y holding X 1 fixed.
Omitting lot size gives us a good estimate of the effect of house size -- 95% CI is (39.28,82.72). But lot size is not held fixed in this regression. If lot size/house size basically always increase in the same proportions together, we can view the coefficient on house size as the increase in mean house price for a one square foot increase in house size and corresponding increase in lot size when bedrooms is held fixed.
Analysis of Variance The goal of analysis of variance is to compare the means of several (many) groups. Analysis of variance is regression with only categorical variables One-way analysis of variance: Groups are defined by one categorical variable.
Milgrams Obedience Experiments Subjects recruited to take part in an experiment on memory and learning. The subject is the teacher. The subject conducted a paired-associated learning task with the student. The subject is instructed by the experimenter to administer a shock to the student each time he gave a wrong response. Moreover, the subject was instructed to move one level higher on the shock generator each time the learner gives a wrong answer. The subject was also instructed to announce the voltage level before administering a shock.
Four Experimental Conditions 1.Remote-Feedback condition: Student is placed in a room where he cannot be seen by the subject nor can his voice be heard; his answers flash silently on signal box. However, at 300 volts the laboratory walls resound as he pounds in protest. After 315 volts, no further answers appear, and the pounding ceases. 2.Voice-Feedback condition: Same as remote- feedback condition except that vocal protests were introduced that could be heard clearly through the walls of the laboratory.
3.Proximity: Same as the voice-feedback condition except that student was placed in the same room as the subject, a few feet from subject. Thus, he was visible as well as audible. 4.Touch-Proximity: Same as proximity condition except that student received a shock only when his rested on a shock plate. At the 150- volt level, the student demanded to be let free and refused to place his hand on the shock plate. The experimenter ordered the subject to force the victims hand onto the plate.
Two Key Questions 1.Is there any difference among the mean voltage levels of the four conditions? 2.If there are differences, what conditions specifically are different?
Multiple Regression Model for Analysis of Variance To answer these questions, we can fit a multiple regression model with voltage level as the response and one categorical explanatory variable (condition). We obtain a sample from each level of the categorical variable (group) and are interested in estimating the population means of the groups based on these samples. Assumptions of multiple regression model for one-way analysis of variance: –Linearity: automatically satisfied. –Constant variance: Spread within each group is the same. –Normality: Distribution within each group is normally distributed. –Independence: Sample consists of independent observations.
Comparing the Groups The coefficient on Condition[Proximity]= means that proximity is estimated to have a mean that is less than the mean of the means of all the conditions. Sample mean of proximity group.
Effect Test tests null hypothesis that the mean in all four conditions is the same versus alternative hypothesis that at least two of the conditions have different means. p-value of Effect Test < Strong evidence that population means are not the same for all four conditions.
1.Is there any difference among the mean voltage levels of the four conditions? Yes, there is strong evidence of a difference. p-value of Effect Test < If there are differences, what conditions specifically are different? This involves the problem of multiple comparisons. We will study this on Tuesday, December 7 th after the midterm.