Analysis of Covariance
ANOVA is a class of statistics developed to evaluate controlled experiments. Experimental control, random selection of subjects, and random assignment of subjects to subgroups are devices to control or hold constant all the other (UNMEASURED) influences on the dependent (Y ij ) variable so that the effects of the independent (X ij ) variable can be assessed. Without experimental control, random selection, and random assignment, other (non-random) differences besides the treatment variable enter the picture. Remember: Inferential statistics only assess the likelihood that chance could have affected the sample results; they do not take into account non-random factors.
For example, without randomly selecting students and compelling them to take PPD 404, then randomly assigning them to an instructor, plus controlling their lives for an entire semester (e.g., forbidding them to work), differences that are not random creep in. To some extent, this problem of uncontrolled, non- random differences can be compensated for by introducing covariates as statistical controls. Covariates are continuous variables that hold constant non-random differences. For example, by asking students how many hours per week they were working, we could add this variable to our ANOVA model. Let's look briefly at the analysis of covariance with one of the classic examples in the statistical literature.
The data are from an experiment involving the use of two drugs for treating leprosy. Drug A and Drug B were experimental drugs; Drug C was a placebo. Subjects were children in the Philippines suffering from leprosy. Thirty children who were taken to a clinic were given either Drug A, B, or C (the treatments) in order of their arrival. Thus each subgroup consisted of 10 children. The outcome measure, Y ij, was a microscopic count of leprosy bacilli in samples taken from six body sites on each child at the end of the experiment. Data are in the following table.
——————————————————————————————————————————————————————— Group A Group B Group C Y Y 2 X X 2 XY Y Y 2 X X 2 XY Y Y 2 X X 2 XY ——————————————————————————————————————————————————————— ——————————————————————————————————————————————————————— n 1 = 10 n 2 = 10 n 3 = 10 Y = 237 X = 322N = 30 _ _ Y = 7.90 X = 10.73
First, let’s perform a one-way analysis of variance on these data. As a short-cut for calculating the sum of squares, we will use the following algorithm: SS = Y 2 - ( Y) 2 / N This is read: sum of squares equals the sum of the squared values of Y minus the sum of the Y-values squared with the difference between them divided by N. This short-cut was developed to speed up calculations in the days before widespread use of computer software. Three variations of this short-cut will be used.
First, let's find the total sum of squares. This is the sum of all the squared Y-values less the sum of the Y- values squared with the difference between them divided by N: SS Total(Y) = ( ) - [(237) 2 /30] SS Total(Y) = (3161) - (56169/30) SS Total(Y) = SS Total(Y) = 1289
Next, the between sum of squares can be obtained by applying the short-cut equation: SS Between(Y) = [(53) 2 /10 + (61) 2 /10 + (123) 2 /10] - [(237) 2 /30] SS Between(Y) = SS Between(Y) = 294
Finally, the sum of squares within: SS Within(Y) = SS Total(Y) - SS Between(Y) SS Within(Y) = SS Within(Y) = 995 Degrees of freedom are as before: N - 1 for total, J - 1 for between, and N - J for within.
These results can be assembled in the usual ANOVA summary table. —————————————————————————————————————————————————————————————— SourceSS df Mean SquareF —————————————————————————————————————————————————————————————— Between Groups Within Groups Total ——————————————————————————————————————————————————————————————
Because we do not want to "jump to conclusions" with the experimental drugs, an alpha level of 0.05 is too modest. Let's set alpha at This means that we have only one chance in 100 of wrongly rejecting the null hypothesis (ruling out chance as the explanation for differences in the effectiveness of the drugs). With alpha = 0.01, the critical value of F with 2 and 27 degrees of freedom is 5.49 (Appendix 3, p. 545). Since F is only 3.989, we CANNOT reject the null hypothesis. There is no evidence that either Drug A or Drug B is different from the placebo, Drug C, nor is there evidence that Drugs A and B differ from one another.
The researchers wanted to be sure that the children in each of the three groups were equally ill at the beginning of the experiment. Perhaps one of the drugs was effective, but, because the children who received it were more sick than those in the other groups, its effects were masked. A measure of illness at the START of the experiment was added to the statistical analysis as a control variable—as a covariate. This covariate was the count of bacilli at the same six body sites, but these counts were taken BEFORE any drugs were given. These data are in the table under columns headed X.
——————————————————————————————————————————————————————— Group A Group B Group C Y Y 2 X X 2 XY Y Y 2 X X 2 XY Y Y 2 X X 2 XY ——————————————————————————————————————————————————————— ——————————————————————————————————————————————————————— n 1 = 10 n 2 = 10 n 3 = 10 Y = 237 X = 322N = 30 _ _ Y = 7.90 X = 10.73
The general linear model that includes the influence of this covariate is written: Y ij = + j X 1ij + X 2ij + ij where is a linear coefficient expressing the influence of the covariate, X 2ij, on the dependent variable, Y ij. If the covariate has no influence, = 0.0. Therefore, the X 2ij products all would be 0.0, and this term would drop out, leaving Y ij = + j X 1ij + ij
To adjust for the presence of the covariate, we need to calculate sums of squares and degrees of freedom for the covariate, X 2, AS WELL AS for the covariance between X 2 and Y. We construct covariance sums of squares from the cross-products, XY (the final column in each of the three table panels). Total sum of squares, between sum of squares, and within sum of squares for the covariate, X 2, are straightforward. For the total sum of squares for X: SS Total(X) = ( ) - [(322) 2 /30] SS Total(Y) = (4122) - (103,684/30) SS Total(X) = SS Total(X) = 666
——————————————————————————————————————————————————————— Group A Group B Group C Y Y 2 X X 2 XY Y Y 2 X X 2 XY Y Y 2 X X 2 XY ——————————————————————————————————————————————————————— ——————————————————————————————————————————————————————— n 1 = 10 n 2 = 10 n 3 = 10 Y = 237 X = 322N = 30 _ _ Y = 7.90 X = 10.73
For the between sum of squares: SS Between(X) = [(93) 2 /10 + (100) 2 /10 + (129) 2 /10] - [(322) 2 /30] SS Between(X) = 3529 – 3456 SS Between(X) = 73 And for the within sum of squares: SS Within(X) = SS Total(X) - SS Between(X) SS Within(X) = 666 – 73 SS Within(X) = 593
For the cross-products, we use the same approach; e.g., for the cross-product total sum of squares: SS Total(XY) = ( ) - [(322)(237)/30] SS Total(XY) = (3277) - (76,314/30) SS Total(XY) = 3277 – 2544 SS Total(XY) = 733
For the cross-product between sum of squares: SS Between(XY) = [(53)(93)10 + (61)(100)10 + (123)(129)/10] - [(322)(237)/30] SS Between(XY) = SS Between(YX) = 146 And for the cross-product within sum of squares: SS Within(XY) = SS Total(XY) - SS Between(XY) SS Within(XY) = SS Within(XY) = 587
Adjustments to the simple ANOVA results for the presence of the covariate should also look familiar. We need to adjust the within sum of squares, the between sum of squares, the within degrees of freedom, and the between degrees of freedom. Total sum of squares and total degrees of freedom are unchanged because (a) we are still trying to account for total variance in the dependent variable, Y ij, and (b) we have the same number of subjects, 30.
The within sum of squares adjustment is: SS Within(Adj) = SS Within(Y) - [(SS Within(XY) ) 2 / SS Within(X) ] SS Within(Adj) = [(587) 2 / 593] SS Within(Adj) = (344,569 / 593) SS Within(Adj) = 995 – 581 SS Within(Adj) = 414
The adjustment for the between sum of squares is: SS Between(Adj) = SS Total - SS Within(Adj) SS Between(Adj) = SS Between(Adj) = 875 We lose a degree of freedom within from the total degrees of freedom for the presence of the covariate. The adjustment is df Within(Adj) = N - J - K where K is the number of covariates. Here, df Within(Adj) = = 26
Because of the IDENTITY between degrees of freedom, the adjustment for the between degrees of freedom is simply df Between(Adj) = df Total - df Within(Adj) df Between(Adj) = = 3 The analysis of covariance results are contained in following table.
—————————————————————————————————————————————————————————————— SourceSS df Mean SquareF —————————————————————————————————————————————————————————————— Between Groups Within Groups Total —————————————————————————————————————————————————————————————— The presence of the covariate in the general linear model makes quite a difference. The within sum of squares has been reduced to 414 from 995 with the loss of only one degree of freedom. The between sum of squares—reflecting the differences among drugs—has nearly tripled, from 294 to 875 with a gain of only one degree of freedom. As a result, the F-ratio is now
With alpha at 0.01 and 3 and 26 degrees of freedom, the critical value is now 4.64 (Appendix 3, p. 545). Thus, we REJECT the null hypothesis that none of the three drugs was different from any other: H 0 : 1 = 2 = 3 We conclude that the effect of at least ONE of the drugs differed from that of the others when we control for the seriousness of illness at the start of the experiment. To determine which drug(s) differ, we need to perform a comparison test such as the Scheffé test.
First, we need to "adjust" the subgroup means for the effect of the covariate. To do this, we need to calculate the value of the constant, . With this value we can calculate adjusted subgroup means. The algorithm is: = SS Within(XY) / SS Within(X) From the analysis of variance summary table, = (587) / (593) = The adjustment for the covariate is:
The adjustments for the influence of the covariate are: _ Y adj = [0.99( )] = 6.72 _ Y adj = [0.99( )] = 6.82 _ Y adj = [0.99( )] = We can test the significance of difference between pairs of these subgroup means using the post hoc comparison method described earlier. Visual inspection of these adjusted means shows that children receiving Drug A and Drug B had fewer leprosy bacilli at the end of the experiment than did those children receiving Drug C, the placebo, controlling for pre-treatment illness.
Using SAS for Analysis of Variance and Covariance LIBNAME perm 'a:\'; LIBNAME library 'a:\'; OPTIONS NODATE NONUMBER PS=66; PROC GLM DATA=perm.drugtest; CLASS drug; MODEL posttest = drug; TITLE1 'One ‑ Way Analysis of Variance Example'; TITLE2; TITLE3 'PPD 404'; RUN; PROC GLM DATA=perm.drugtest; CLASS drug; MODEL posttest = drug pretest; TITLE1 'Analysis of Covariance Example'; TITLE2; TITLE3 'PPD 404'; RUN;
One ‑ Way Analysis of Variance Example PPD 404 General Linear Models Procedure Class Level Information Class Levels Values DRUG 3 A B C Number of observations in data set = 30 General Linear Models Procedure Dependent Variable: POSTTEST Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total R ‑ Square C.V. Root MSE Y Mean Source DF Type I SS Mean Square F Value Pr > F DRUG Source DF Type III SS Mean Square F Value Pr > F DRUG
Analysis of Covariance Example PPD 404 General Linear Models Procedure Class Level Information Class Levels Values DRUG 3 A B C Number of observations in data set = 30 Dependent Variable: POSTTEST Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total R ‑ Square C.V. Root MSE Y Mean Source DF Type I SS Mean Square F Value Pr > F DRUG PRETEST Source DF Type III SS Mean Square F Value Pr > F DRUG PRETEST