week 101 ANOVA F Test in Multiple Regression In multiple regression, the ANOVA F test is designed to test the following hypothesis: This test aims to assess whether or not the model have any predictive ability. The test statistics is If H 0 is true, the above test statistics has an F distribution with k, n-k-1 degrees of freedom.
week 102 F-Test versus t-Tests in Multiple Regression In multiple regression, the F test is designed to test the overall model while the t tests are designed to test individual coefficients. If the F-test is significant and all or some of the t-tests are significant, then there are some useful explanatory variables for predicting Y. If the F-test is not significant (large P-value), and all the t-tests are not significant, it means that no explanatory variable contribute to the prediction of Y. If the F-test is significant and all the t-tests are not significant, then it is an indication of “multicolinearity” – i.e., correlated X’s. It means that individual X’s don’t contribute to the prediction of Y over and above other X’s.
week 103 If the F-test is not significant and some of the t-tests are significant, it is an indication of one of two things: The model has no predictive ability but if there are many predictors, we can expect to get some type I errors in t-tests. Predictors were chosen poorly. If one useful predictor is added to many that are unrelated to the outcome its contribution may not be enough for model to have statistically significant predictive ability.
week 104 CIs and Pls in Multiple Regression The standard error of the estimate of the mean value of Y at new values of the explanatory variables (X h ) is: 100(1-α)% CI for the mean value of Y at X h is: The standard error of the predicted value of Y at new values of the explanatory variables (X h ) is: 100(1-α)% CI for the predicted value of Y at X h is:
week 105 Example Consider the house prices example. Suppose we are interested in predicting the price of a house with 2 bdr, 750 sqft, 1 fp, 5 rms, storm windows (st=1), 25 foot lot, 1.5 baths and a 1 car garage. Then X h is ….
week 106 Multicollinearity Multicollinearity occurs when explanatory variables are highly correlated, in which case, it is difficult or impossible to measure their individual influence on the response. The fitted regression equation is unstable. The estimated regression coefficients vary widely from data set to data set (even if data sets are very similar) and depending on which predictor variables are in the model. The estimated regression coefficients may even have opposite sign than what is expected (e.g, bedroom in house price example).
week 107 The regression coefficients may not be statistically significant from 0 even when corresponding explanatory variable is known to have a relationship with the response. When some X’s are perfectly correlated, we can’t estimate β because X’X is singular. Even if X’X is close to singular, its determinant will be close to 0 and the standard errors of estimated coefficients will be large.
week 108 Assessing Multicollinearity To asses multicolinearity we calculate the Variance Inflation Factor for each of the predictor variables in the model. The variance inflation factor for the i th predictor variable is defined as where is the coefficient of multiple determination obtained when the i th predictor variable is regressed against other predictor variables. Large value of VIF i is a sign of multicollinearity.
week 109 Indicator Variables Often, a data set will contain categorical variables which are potential predictor variables. To include these categorical variables in the model we define dummy variables. A dummy variable takes only two values, 0 and 1. In categorical variable with j categories we need j-1 indictor variables.
week 1010 Example Meadowfoam is a small plant found in the US Pacific Northwest. Its seed oil is unique among vegetable oils for its long carbon strings, and it is nongreasy and highly stable.A study was conducted to find out how to elevate meadowfoam production to a profitable crop. In a growth chamber, plants were grown under 6 light intensities (in micromol/m^2/sec) and two timings of the onset of the light treatment, either late (coded 0) or early (coded 1). The response variable is the average number of flowers per plant for 10 seedlings grown under each of the 12 treatment conditions. This is an example of an experiment in which we can make causal conclusions. There are two explanatory variables, light intensity and timing. There are 24 data points, 2 at each treatment combination.
week 1011 Question of Interests What is the effect of timing on the seedling growth? What are the effects of the different light intensity? Does the effect of intensity depend on timing?
week 1012 Indicator Variables in Meadowfoam Example To include the variable time in the model we define a dummy variable that takes the value 1 if early timing and the value 0 if late timing. The variable intensity has 6 levels (150, 300, 450, 600, 750, 900). We will treat these levels as 6 categories. It is useful to do so if we expect a complex relationship between response variable and intensity and if the goal is to determine which intensity level is “best”. The cost in using dummy variables is degrees of freedom since we need multiple dummy variables for each of the multiple categories. We define the dummy variables as follows….
week 1013 Partial F-test Partial F-test is designed to test whether a subset of β’s are 0 simultaneously. The approach has two steps. First we fit a model with all predictor variables. We call this model the “full model”. Then we fit a model without the predictor variables whose coefficients we are interested in testing. We call this model the “reduced model”. We then compare the SSR and SSE in these two models….
week 1014 Test Statistic for Partial F-test To test whether some of the coefficients of the explanatory variables are all 0 we use the following test statistic:. Where Extra SS = SSEred - SSEfull, and Extra df = number of parameters being tested. To get the Extr SS in SAS we can simply fit two regressions (reduced and full) or we can look at Type I SS which are also called Sequential Sum of Squares. The Sequential SS gives the additional contribution to SSR each variable gives over and above variables previously listed. The Sequential SS depends on which order variables are stated in model statement; the variables whose coefficients we want to test must be listed last.
week 1015 Partial Correlation Recall, for simple regression, the correlation between X and Y is When considering the reduced/full model when the full model has only 1 additional predictor variable, the coefficient of partial correlation is It is negative if the coefficient of the additional predictor variable is negative and positive otherwise. It is a measure of the contribution of the additional predictor variable, given that the others are in the model.