Download presentation
Presentation is loading. Please wait.
Published byApril McKinney Modified over 9 years ago
1
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18
2
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 2 Anova Models These are linear (regression) models where all the explanatory variables are categorical. If there is just one categorical explanatory variable, then we have the “one-way anova” model discussed in STATS 201/8 If there are two categorical explanatory variables, then we have the “two-way anova” model, also discussed in STATS 201/8 However, we shall regard these as just another type of regression model
3
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 3 Example: One way model In an experiment to study the effect of carcinogenic substances, six different substances were applied to cell cultures. The response variable (ratio) is the ratio of damaged to undamaged cells, and the explanatory variable (treatment) is the substance On website – carcinogenic substances data
4
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 4 Data ratio treatment 1 0.08 control 2 0.08 choral hydrate 3 0.10 diazapan 4 0.10 hydroquinone 5 0.07 econidazole 6 0.17 colchicine 7 0.08 control 8 0.10 choral hydrate 9 0.08 diazapan 10 0.10 hydroquinone 11 0.08 econidazole 12 0.19 colchicine 13 0.09 control 14 0.08 choral hydrate 150.12 diazapan... More data
5
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 5 Distributions skewed? boxplot(ratio~treatment, data=cancer.df, ylab = "ratio", main = "Ratios for different substances")
6
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 6 The model where the mean depends on the substance. Thus, We make the usual assumptions about the errors (normal, equal variance, independent etc)
7
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 7 Offset form
8
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 8 Dummy variable form Define: CH= 1 if treatment = Choral Hydrate, 0 else D = 1 if treatment = diazapan, 0 else H = 1 if treatment = hydroquinone, 0 else E = 1 if treatment = econidazole, 0 else C = 1 if treatment = colchicine, 0 else Then
9
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 9 Estimation To estimate the offsets and the baseline (control) mean, we use lm as usual. We have to rearrange the levels to make the control the baseline carcin.df = read.table(file.choose(), header=T) carcin.df$treatment = factor(carcin.df$treatment, levels = c("control", "chloralhydrate", "colchicine", "diazapan", "econidazole", "hydroquinone")) summary(lm(ratio~treatment, data=carcin.df))
10
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 10 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.23660 0.02037 11.616 < 2e-16 *** treatmentchloralhydrate 0.03240 0.02880 1.125 0.26158 treatmentcolchicine 0.21160 0.02880 7.346 2.02e-12 *** treatmentdiazapan 0.04420 0.02880 1.534 0.12599 treatmenteconidazole 0.02820 0.02880 0.979 0.32838 treatmenthydroquinone 0.07540 0.02880 2.618 0.00931 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.144 on 294 degrees of freedom Multiple R-squared: 0.1903, Adjusted R-squared: 0.1766 F-statistic: 13.82 on 5 and 294 DF, p-value: 3.897e-12 lm output
11
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 11 Non- normal? Variances about equal ignore
12
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 12 boxcoxplot(ratio~ treatment, data=carcin.df)
13
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 13 Analyzing ¼ power WB test: previous p=0.00 Current p=0.06 Normality better
14
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 14 Analyzing ¼ power: summary Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.68528 0.01244 55.105 < 2e-16 *** treatmentchloralhydrate 0.01744 0.01759 0.992 0.3222 treatmentcolchicine 0.12120 0.01759 6.891 3.37e-11 *** treatmentdiazapan 0.02993 0.01759 1.702 0.0898. treatmenteconidazole 0.01470 0.01759 0.836 0.4039 treatmenthydroquinone 0.04104 0.01759 2.333 0.0203 * Residual standard error: 0.08793 on 294 degrees of freedom Multiple R-squared: 0.1714, Adjusted R-squared: 0.1573 F-statistic: 12.16 on 5 and 294 DF, p-value: 1.008e-10 Residual standard error: 0.08793 on 294 degrees of freedom Multiple R-Squared: 0.1714, Adjusted R-squared: 0.1573 F-statistic: 12.16 on 5 and 294 DF, p-value: 1.008e-10
15
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 15 Testing equality of means The standard F-test for equality of means is computed using the anova function Here comparing equal means model (Null model) with different means model – only one term in model > quarter.lm <- lm(ratio^(1/4)~treatment, data=carcin.df) > anova(quarter.lm) Analysis of Variance Table Response: ratio^(1/4) Df Sum Sq Mean Sq F value Pr(>F) treatment 5 0.47017 0.09403 12.161 1.008e-10 *** Residuals 294 2.27337 0.00773 Highly significant differences
16
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 16 Oneway plot plot (s20x) > onewayPlot(quarter.lm) Tukey: all cover true values with 95% prob
17
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 17 Two factors: example Experiment to study weight gain in rats –Response is weight gain over a fixed time period –This is modelled as a function of diet (Beef, Cereal, Pork) and amount of feed (High, Low)
18
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 18 Data > diets.df gain source level 1 73 Beef High 2 98 Cereal High 3 94 Pork High 4 90 Beef Low 5 107 Cereal Low 6 49 Pork Low 7 102 Beef High 8 74 Cereal High 9 79 Pork High 10 76 Beef Low... 60 observations in all
19
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 19 Two factors: the model If the (continuous) response depends on two categorical explanatory variables, then we assume that the response is normally distributed with a mean depending on the combination of factor levels: if the factors are A and B, the mean at the i th level of A and the j th level of B is ij Other standard assumptions (equal variance, normality, independence) apply
20
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 20 Diagramatically… Source = Beef Source = Cereal Source = Pork Level =High 11 12 13 Level =Low 21 22 23
21
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 21 Decomposition of the means We usually want to split each “cell mean” up into 4 terms: –A term reflecting the overall baseline level of the response –A term reflecting the effect of factor A (row effect) –A term reflecting the effect of factor B (column effect) –A term reflecting how A and B interact.
22
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 22 Mathematically… Overall Baseline: 11 (mean when both factors are at their baseline levels) Effect of i th level of factor A (row effect): i1 11 The i th level of A, at the baseline of B, expressed as a deviation from the overall baseline) Effect of j th level of factor B (column effect) : 1j - 11 (The j th level of B, at the baseline of A, expressed as a deviation from the overall baseline) Interaction: what’s left over (see next slide)
23
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 23 Interactions Each cell (except the first row and column) has an interaction: Interaction = cell mean - baseline - row effect - column effect cell mean = baseline + row effect + column effect + interaction
24
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 24 Notation Overall baseline: = 11 Main effects of A i = i1 - 11 Main effects of B j = 1j - 11 AB interactions: ij = ij - i1 - 1j + 11 Thus, ij = i + j + ij
25
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 25 Importance of interactions If the interactions are all zero, then the effect of changing levels of A is the same for all levels of B In mathematical terms, ij – i ’ j doesn’t depend on j Equivalently, effect of changing levels of B is the same for all levels of A If interactions are zero, relationship between factors and response is simple
26
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 26 Why are comparisons simple when interactions are zero? Doesn’t depend on i!
27
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 27 Splitting up the mean: rats Cell Means BeefCerealPorkBaseline col High10085.999.5100 Low79.283.978.779.2 Baseline row10085.999.5100 Factors are : level (amount of food) and source (diet) Row effect for Low: 79.2 – 100 = -20.8 Col effect for Cereal: 85.9 - 100 = -14.1 Col effect for Pork: 99.5 - 100 = -0.5 Low-Cereal interaction: 83.9 - 79.2 - 85.9 + 100 = 18.8 Low-Cereal interaction: 78.7 - 79.2 - 99.5 + 100 = 0
28
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 28 Exploratory plots > plot.design(diets.df) More gain on high amount of feed and Beef diet
29
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 29 dotplot(source~gain|level, data=diets.df)
30
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 30 Fit model > diets.lm<-lm(gain~source+level + source:level, data=diets.df) > summary(diets.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 sourceCereal -1.410e+01 6.551e+00 -2.152 0.03585 sourcePork -5.000e-01 6.551e+00 -0.076 0.93944 levelLow -2.080e+01 6.551e+00 -3.175 0.00247 sourceCereal:levelLow 1.880e+01 9.264e+00 2.029 0.04736 sourcePork:levelLow -3.052e-14 9.264e+00 -3.29e-15 1.00000 Residual standard error: 14.65 on 54 degrees of freedom Multiple R-Squared: 0.2848, Adjusted R-squared: 0.2185 F-statistic: 4.3 on 5 and 54 DF, p-value: 0.002299
31
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 31 Fitting as a regression model Note that this is equivalent to fitting a regression with dummy variables R2, C2, C3 R2 = 1 if obs is in row 2, zero otherwise C2 = 1 if obs is in column 2, zero otherwise C3 = 1 if obs is in column 3, zero otherwise The regression is Y ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3)
32
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 32 > R2 = ifelse(diets.df$level=="Low",1,0) > C2 = ifelse(diets.df$source=="Cereal",1,0) > C3 = ifelse(diets.df$source=="Pork",1,0) > reg.lm = lm(gain ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3), data=diets.df) > summary(reg.lm) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 *** C2 -1.410e+01 6.551e+00 -2.152 0.03585 * C3 -5.000e-01 6.551e+00 -0.076 0.93944 R2 -2.080e+01 6.551e+00 -3.175 0.00247 ** I(R2 * C2) 1.880e+01 9.264e+00 2.029 0.04736 * I(R2 * C3) -2.709e-14 9.264e+00 -2.92e-15 1.00000 Regression summary
33
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 33 Testing for zero interactions > anova(diets.lm) Analysis of Variance Table Response: gain Df Sum Sq Mean Sq F value Pr(>F) source 2 266.5 133.3 0.6211 0.5411319 level 1 3168.3 3168.3 14.7666 0.0003224 *** source:level 2 1178.1 589.1 2.7455 0.0731879. Residuals 54 11586.0 214.6 Some evidence of interaction
34
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 34 Interaction plot > interaction.plot(source,level,gain) Non-parallel lines indicate interaction
35
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 35 Do we need Source in the model? > model1<-lm(gain~source*level) # note shorthand > model2<-lm(gain~level) > anova(model2,model1) Analysis of Variance Table Model 1: gain ~ level Model 2: gain ~ source * level Res.Df RSS Df Sum of Sq F Pr(>F) 1 58 13030.7 2 54 11586.0 4 1444.7 1.6833 0.1673 Not significant! No significant effect of Source
36
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 36 Notations: review For two factors A and B Baseline: = 11 A main effect: i = i1 - 11 B main effect: j = 1j - 11 AB interaction: ij = ij - i1 - 1j + 11 Then ij = + i + j + ij
37
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 37 Zero interaction model If we have only one observation per factor level combination, we can’t estimate the interactions and the error variance We have to assume that the interactions are zero and fit an “additive model” gain ~ level + source Can test zero interactions In a reduced form) using the “Tukey one-degree of freedom test” –
38
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 38 Possible models for two factors For two factors A and B possible models are: Y~1 (Fit single mean only) Y~A (cell means depend on A alone) Y~B (Cell means depend on B alone) Y~A+B (Cell means have no interaction) Y~A*B (General model, cell means have no restrictions)
39
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 39 In terms of “effects” General model is Y~A+B+A:B (equivalently Y~A*B) Mathematical form is E(Y ij ) = + i + j + ij Y~1 implies i =0, j =0, ij =0 Y~A implies j =0, ij =0 Y~B implies i =0, ij =0 Y~A+B implies ij =0
40
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 40 Interpreting Anova All F-tests essentially compare a model to a sub- model, using an estimate of 2 in the denominator: The anova function can do this explicitly, as in anova(model1, model2), with the estimate of 2 coming from the bigger model. When we use just 1 argument, as in anova(model1), the models being compared are selected implicitly
41
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 41 Interpreting Anova (cont) For example, consider a model with 2 factors A and B: > anova(lm(y~A+B+A:B)) Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) A 1 12.774 12.774 9.9147 0.003978 ** B 2 4.031 2.015 1.5642 0.227629 A:B 2 6.898 3.449 2.6768 0.086985. Residuals 27 34.788 1.288 Full-model estimate of 2
42
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 42 First line The first line of the table compares the model y~A with a null model (all means the same), using an estimate of 2 =1.288 from the full model y~A+B+A:B > model1<-lm(y~A) > model0<-lm(y~1) > anova(model0,model1) Analysis of Variance Table Model 1: y ~ 1 Model 2: y ~ A Res.Df RSS Df Sum of Sq F Pr(>F) 1 32 58.491 2 31 45.716 1 12.774 8.6623 0.006105 ** Difference in Numerator of F test
43
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 43 Second line The second line of the table compares the “no interaction” model y~A+B with the model y~A, using an estimate of 2 from the full model y~A+B+A:B > model2<-lm(y~A+B) > model1<-lm(y~A) > anova(model1,model2) Analysis of Variance Table Model 1: y ~ A Model 2: y ~ A + B Res.Df RSS Df Sum of Sq F Pr(>F) 1 31 45.716 2 29 41.685 2 4.031 1.4021 0.2623 Difference in Numerator of F test in line 2
44
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 44 Third line The third line of the table compares full model y~A+B+A:B with the “no interaction” model y~A+B, using an estimate of 2 from the full model > model2<-lm(y~A+B) > model3<-lm(y~A+B+A:B) > anova(model2,model3) Analysis of Variance Table Model 1: y ~ A + B Model 2: y ~ A + B + A:B Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 41.685 2 27 34.788 2 6.898 2.6768 0.08699. Difference in Numerator of F test in line 3
45
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 45 To summarise: Terms are added line by line The F-test compares the current model with the previous model At each stage, the estimate of 2 is obtained from the full model.
46
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 46 More than two factors: example An experiment was conducted to compare different diets for feeding chickens. The diets depended on 3 variables: –Source of Protein (variable protein) : either “groundnut” or “soybean” –Level of protein (variable protlevel): either 0, 1 or 2 –Level of fish solubles (variable fish) :either high or low Response variable was weight gain (variable chickweight)
47
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 47 data chickweight protein protlevel fish 1 6559 groundnut 0 Low 2 7075 groundnut 0 High 3 6564 groundnut 1 Low 4 7528 groundnut 1 High 5 6738 groundnut 2 Low 6 7333 groundnut 2 High 7 7094 soybean 0 Low 8 8005 soybean 0 High 9 6943 soybean 1 Low 10 7359 soybean 1 High 11 6748 soybean 2 Low 12 6764 soybean 2 High... 24 observations in all
48
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 48 Data characteristics There are 3 factors –protein with 2 levels (groundnut, soybean) –protlevel with 3 levels (0,1,2) –fish with 2 levels high, low There are 2 x 3 x 2 = 12 factor level combinations, so 12 means Each combination is observed twice, so 24 observations in all
49
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 49 Interactions Let ijk be the population mean of all observations taken at level i of protein, level j of protlevel and level k of fish We can split this mean up into 8 terms: An overall baseline = 111 3 “main effects” e.g. i = i1 1 - 111 3 “two-way interactions” e.g. ij ij i 1j1 A “3-way interaction” ijk ijk i 1j1 11k ij 1jk i1k
50
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 50 Interactions (cont) Then ijk i j k ij jk ik ijk As before, if any one of the subscripts i, j, k is 1 then the corresponding interaction is zero. Interpretation: e.g. if the protlevel x fish and the 3-way interactions are all zero, then the effect of changing levels of fish is the same for all levels of protlevel.
51
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 51 Why? Doesn’t depend on j!
52
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 52 Estimating terms > model1<-lm(chickweight~protein*protlevel*fish, data=chickwts.df) > summary(model1) Estimate Std. Error t value Pr(>|t|) (Intercept) 6927.0 223.5 30.992 8e-13 proteinsoybean 904.0 316.1 2.860 0.0144 protlevel1 266.5 316.1 0.843 0.4156 protlevel2 -80.0 316.1 -0.253 0.8045 fishLow -501.5 316.1 -1.587 0.1386 proteinsoybean:protlevel1 -772.0 447.0 -1.727 0.1098 proteinsoybean:protlevel2 -1089.0 447.0 -2.436 0.0314 proteinsoybean:fishLow -256.0 447.0 -0.573 0.5774 protlevel1:fishLow -99.0 447.0 -0.221 0.8285 protlevel2:fishLow 245.5 447.0 0.549 0.5929 proteinsoybean:protlevel1:fishLow 127.0 632.2 0.201 0.8441 proteinsoybean:protlevel2:fishLow 435.0 632.2 0.688 0.5045 Residual standard error: 316.1 on 12 degrees of freedom Multiple R-Squared: 0.7531, Adjusted R-squared: 0.5269 F-statistic: 3.328 on 11 and 12 DF, p-value: 0.02482
53
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 53 Anova for the chick weights > anova(model1) Analysis of Variance Table Response: chickweight Df Sum Sq Mean Sq F value Pr(>F) protein 1 373003 373003 3.7334 0.077286. protlevel 2 636519 318260 3.1854 0.077679. fish 1 1423014 1423014 14.2429 0.002653 ** protein:protlevel 2 858702 429351 4.2974 0.039134 * protein:fish 1 7073 7073 0.0708 0.794706 protlevel:fish 2 309421 154710 1.5485 0.252201 protein:protlevel:fish 2 50036 25018 0.2504 0.782453 Residuals 12 1198926 99911 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Suggests model protein*protlevel + fish
54
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 54 Check > anova(lm(chickweight~protein*protlevel + fish), lm(chickweight~protein*protlevel*fish)) Analysis of Variance Table Model 1: chickweight ~ protein * protlevel + fish Model 2: chickweight ~ protein * protlevel * fish Res.Df RSS Df Sum of Sq F Pr(>F) 1 17 1565456 2 12 1198926 5 366530 0.7337 0.6121 Not significant, but interpret with caution Effect of fish the same for each protein/protlevel combination
55
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 55 Interpretation of interactions If a factor (say A) does not interact with the others, the effect of changing levelsof A is the same for all levels of the other factors If the 3 way interactions are zero, then the interaction between A and B is the same for all levels of C
56
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 56 Summary Anova models are interpreted just like regressions, except –No question of planarity (linear by definition ) –Need to interpret interactions –Judge effect of factors by anova –Factors in anova added one at a time –Suitable for completely randomised experiments where it is reasonable to assume observations are independent
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.