Multiple Regression in Practice The value of outcome variable depends on several explanatory variables. The value of outcome variable depends on several explanatory variables. F-test. To judge whether the explanatory variables in the model adequately describe the outcome variable. F-test. To judge whether the explanatory variables in the model adequately describe the outcome variable. t-test. Applies to each individual explanatory variable. Significant t indicates whether the explanatory variable has an effect on outcome variable while controlling for other X’s. t-test. Applies to each individual explanatory variable. Significant t indicates whether the explanatory variable has an effect on outcome variable while controlling for other X’s. T-ratio. To judge the relative importance of the explanatory variable. T-ratio. To judge the relative importance of the explanatory variable.
Basic Assumptions Mean value of the outcome variable for a set of explanatory variables is described by the regression equation. Mean value of the outcome variable for a set of explanatory variables is described by the regression equation. Normal distribution of values around the regression line. Normal distribution of values around the regression line. Variance around the regression line is the same for all values of the explanatory variables. Variance around the regression line is the same for all values of the explanatory variables. The explanatory variables are not correlated. The explanatory variables are not correlated.
Problem of Multicollinearity When explanatory variables are correlated there is difficulty in interpreting the effect of explanatory variables on the outcome. When explanatory variables are correlated there is difficulty in interpreting the effect of explanatory variables on the outcome. Check by: Correlation coefficient matrix (see next slide). Correlation coefficient matrix (see next slide). F-test significant with insignificant t. F-test significant with insignificant t. Large changes occur in the regression coefficients when variables are added or deleted. (Variance Inflation). Vi > 4 or 5 means there is multicollinearity. Large changes occur in the regression coefficients when variables are added or deleted. (Variance Inflation). Vi > 4 or 5 means there is multicollinearity.
Example of a Matrix Plot This matrix plot comprises several scatter plots to provide visual information as to whether variables are correlated This matrix plot comprises several scatter plots to provide visual information as to whether variables are correlated The arrow points at a scatter plot where two explanatory variables are strongly correlated The arrow points at a scatter plot where two explanatory variables are strongly correlated
Selecting the most Economic Model The purpose is to find the smallest number of explanatory variables which make the maximum contribution to the outcome. After excluding variables that may be causing multicollinearity, examine the table of t-ratios in the full model. Those variables with a significant t are included in the sub-set. After excluding variables that may be causing multicollinearity, examine the table of t-ratios in the full model. Those variables with a significant t are included in the sub-set. In the Analysis of Variance table examine the column headed SEQ SS. Check that the candidate variables are indeed making a sizable contribution to the Regression Sum of Squares In the Analysis of Variance table examine the column headed SEQ SS. Check that the candidate variables are indeed making a sizable contribution to the Regression Sum of Squares
Stepwise Regression Analysis Stepwise finds the explanatory variable with the highest R 2 to start with. It then checks each of the remaining variables until two variables with highest R 2 are found. It then repeats the process until three variables with highest R2 are found, and so on. Stepwise finds the explanatory variable with the highest R 2 to start with. It then checks each of the remaining variables until two variables with highest R 2 are found. It then repeats the process until three variables with highest R2 are found, and so on. The overall R 2 gets larger as more variables are added. The overall R 2 gets larger as more variables are added. Stepwise may be useful in the early exploratory stage of data analysis, but not to be relied upon for the confirmatory stage. Stepwise may be useful in the early exploratory stage of data analysis, but not to be relied upon for the confirmatory stage.
Is the Model Adequate? Judged by the following: R2 value. Increase in R2 on adding another variable gives a useful hint R2 value. Increase in R2 on adding another variable gives a useful hint Adjusted R2 is a more sensitive measure. Adjusted R2 is a more sensitive measure. Smallest value of s (standard deviation). Smallest value of s (standard deviation). C-p statistic. A model with the smallest C-p is used such that Cp value is closest to p (the number of parameters in the model) C-p statistic. A model with the smallest C-p is used such that Cp value is closest to p (the number of parameters in the model)