Analysis of Variance: Some Review and Some New Ideas
Remember the concepts of variance and the standard deviation… Variance is the square of the standard deviation Standard deviation (s) - the square root of the sum of the squared deviations from the mean divided by the number of cases. See p. 47 in the text. We now want to use these concepts in regression analysis. We will be learning a new statistical test, the F test, which we will use to assess the statistical significance of a regression equation (not just the coefficients)
We will also use Analysis of Variance (ANOVA)… To compare difference of more than two means…. Which we’ve done to date with a T test.
Equations Mean Variance Standard Deviation Coefficient of Variation
Steps for calculating variance 1. Calculate the mean of a variable 2. Find the deviations from the mean: subtract the variable mean from each case 3. Square each of the deviations of the mean 4. The variance is the mean of the squared deviations from the mean, so sum the squared deviations from step 3 and divide by the number of cases. (When we did these steps before we were interested in going on to calculate a standard deviation and coefficient of variation. Now we’ll just stick with variance.)
Calculating Variance 1. Calculate the mean of a variable 2. Find the deviations from the mean: subtract the variable mean from each case
Calculating Variance, cont. 3. Square each of the deviations of the mean 4. The variance is the mean of the squared deviations from the mean, so sum the squared deviations from step 3 and divide by the number of cases The Sum of the squared deviations = 198.950 Variance = 198.950/20 = 9.948
A New Concept: Sum of Squares The sum of the square deviations from the mean is called the Sum of Squares Remember when we know nothing else about an interval variable, the best estimate of it is its mean. By extension, the sum of squares is the best estimate of the sum of squared deviations if we know nothing else about the variable. But….when we have more information, for example in a statistically significant bivariate regression model, we can improve on the best estimate of the dependent variable by using the information from the independent variable to estimate it.
The regression equation is a better estimator of food costs than the mean of food costs.
Calculating Total Sum of Squares Multiply the variance by N-1, so Total Sum of Squares = 8127.019*(638-1) Statistics TOTAL FOOD COSTS N Valid 638 Missing 0 Mean 270.2310 Variance 8127.019
Calculations for the Regression sum of Squares Regression sum of squares equals the sum of the squares of the deviations between yhat (predicted y) and ymean, RSS = Ʃ (yhat – ymean)2 Residual Sum of Squares = TSS - RSS
Now we want to estimate how much better To do that, we use the sum of squares calculations We partition the total sum of squares (TSS), e.g., the sum of square deviations from the mean, into two parts The first part is the sum of squared deviations using the regression equation (Regression Sum of Squares). The second part is the sum of squared deviations left over, e.g., not accounted for by the regression equation, or more formally, the TSS- Regression Sum of Squares = the Residual Sum of Squares.
Now let’s look at what we’ve accomplished… To do that, we’ll calculate an F test We need to add information about degrees of freedom. Remember the concept…how many parameters can one change and still calculate the statistic. If we want to know the mean, and the know the values, we can calculate the mean. If we know the mean, and we know all the values but one, we can calculate that last value. So there is 1 degree of freedom. For the F test, we need information about the degrees of freedom in the regression model. The formula is k-1 (the number of parameters to be estimated). For the bivariate model, that is a and b, so 2-1=1
Degrees of freedom continued… For the Residual Sum of Squares, the degrees of freedom is N-k, so for this model, 638-2 = 636. We then calculate a mean squares, by dividing the degrees of freedom into the Sum of squares. The F statistic is the regression mean square divided by the residual mean square. The probability of the F statistic is drawn from the probability table.
Another Way to Think about R Square The Regression Sum of Squares divided by the Total Sum of Squares is a measure of the proportion of variance explained by the model. So 2070301.432/5176911.308 = .399991049 or ~40%.