Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills
2 Outline "Refresher" on different types of model: Two-way Anova Ancova
3 When your observations are categorized according to two different factors. Example: Blood calcium concentration in a population of birds, 10 males and 10 females randomly split in 2 groups, and given either hormonal treatment or not. Two-Way Anova > dataset <- read.table("2wayAnova.csv", header=T, sep=",") > attach(dataset) > names(dataset) [1] "plasma" "sex" "hormone" > tapply(plasma, list(sex, hormone), length) no yes female 5 5 male 5 5 balanced design equal replications among groupings
4 Two-Way Anova > stripchart(plasma ~ hormone, +col=c("orange","blue") > h.ave <- tapply(plasma, hormone, mean) > h.ave no yes > stripchart(plasma ~ sex, +col=c("red","green") > s.ave <- tapply(plasma, sex, mean) > s.ave female male
5 Two-Way Anova > h.ave no yes > gd.ave <- mean(plasma); gd.ave [1] > summary(aov(plasma ~ hormone)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** Residuals > SS.h <- 10*( )^2 + ## sum of squares due to 10*( )^2 ## hormone treatment [1] > TSS <- sum((plasma-gd.ave)^2) [1]
6 Two-Way Anova > s.ave female male > gd.ave [1] > summary(aov(plasma ~ sex)) Df Sum Sq Mean Sq F value Pr(>F) sex Residuals > SS.s <- 10*( )^2 + ## sum of squares due to sex 10*( )^2 [1] > TSS <- sum((plasma-gd.ave)^2) [1]
7 Two-Way Anova > summary(aov(plasma ~ hormone)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** Residuals > summary(aov(plasma ~ sex)) Df Sum Sq Mean Sq F value Pr(>F) sex Residuals > summary(aov(plasma ~ hormone + sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** sex Residuals Analysing the two factors at the same time can make a big difference to the outcome
8 Two-Way Anova Including an interaction term is important too > summary(aov(plasma ~ hormone + sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** sex Residuals > summary(aov(plasma ~ hormone * sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** sex hormone:sex Residuals
9 Two-Way Anova What is the meaning of this interaction term ? Assess whether the impact of the two factors are mutually interdependent. > gd.ave [1] > h.diff <- h.ave - gd.ave no yes > s.diff <- s.ave - gd.ave female male > predicted <- predict(aov(plasma~hormone+sex))
10 Two-Way Anova What is the meaning of the interaction term ? Assess whether the impact of the two factors are mutually interdependent. > pred.ave <- tapply(predicted, list(sex, hormone), mean) > pred.ave no yes female male > averages <- tapply(plasma, list(sex, hormone), mean) > averages no yes female male > sum((averages-pred.ave)^2)*5 [1] ## SS due to interaction
11 Two-Way Anova > averages no yes female male > barplot(averages, beside=T, col=c("red","green"),...) > pred.ave no yes female male > barplot(pred.ave, beside=T, col=c("red2","green3"),...) Graphical representations Predicted averages Observed averages
12 Two-Way Anova > interaction.plot(hormone, sex, plasma, col=c("red","green2"),... ) > interaction.plot(sex, hormone, plasma, col=c("orange","blue"),...) Graphical representations
13 Two-Way Anova as a linear model > mod <- lm(plasma ~ sex + hormone) > summary(mod) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-07 *** sexmale hormoneyes e-07 *** --- Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 17 DF, p-value: 1.307e-06 > pred.ave no yes female male
14 Two-Way Anova as a linear model > mod2 <- lm(plasma ~ sex * hormone) > summary(mod2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-06 *** sexmale hormoneyes e-05 *** sexmale:hormoneyes Residual standard error: on 16 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 16 DF, p-value: 7.89e-06 > averages no yes female male
15 Are the assumptions met ? 1: Is the response variable continuous ? 2: Are the residuals normally distributed ? > shapiro.test(mod$residuals) Shapiro-Wilk normality test data: mod$residuals W = , p-value = > plot(mod, which=2) ## qqplot of std.residuals Answer: YES ! YES !
16 3a : Are the residuals independent and identically distributed? > plot(mod, which=1) ## residuals vs fitted values. > plot(mod, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. Answer: Female treated with hormone seem to vary more in their response
17 When one predictor variable is continuous or discrete, and the other predictor variable is categorical. Example: Fruit production in a biennial plant, 40 plants were allocated to two treatments, grazed and ungrazed. The grazed plants were exposed to rabbits during the first two weeks of stem elongation, then allowed to regrow protected by a fence. > dataset2 <- read.table("ipomopsis.txt", header=T, sep="\t") > attach(dataset2) > names(dataset2) [1] "Root" "Fruit" "Grazing" > str(dataset2) 'data.frame': 40 obs. of 3 variables: $ Root : num $ Fruit : num $ Grazing: Factor w/ 2 levels "Grazed","Ungrazed": Ancova
18 > summary(dataset2) Root Fruit Grazing Min. : 4.43 Min. : 14.7 Grazed :20 1st Qu.: st Qu.: 41.1 Ungrazed:20 Median : 7.12 Median : 60.9 Mean : 7.18 Mean : rd Qu.: rd Qu.: 76.2 Max. :10.25 Max. :116.0 > stripchart(Fruit ~ Grazing, col=c("blue","green2"),...) > summary( lm(Fruit ~ Grazing) ) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** GrazingUngrazed * --- Residual standard error: on 38 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 38 DF, p-value: Ancova
19 Ancova > plot(Fruit ~ Root, col=c("blue","green2")[as.numeric(Grazing)],...) ## a few graphical functions omitted > summary( lm(Fruit ~ Root * Grazing) ) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-11 *** Root < 2e-16 *** GrazingUngrazed Root:GrazingUngrazed Residual standard error: on 36 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 36 DF, p-value: < 2.2e-16 slope slope 24.00
20 Ancova > anova( lm(Fruit ~ Root * Grazing) ) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Root < 2.2e-16 *** Grazing e-12 *** Root:Grazing Residuals > anova( lm(Fruit ~ Grazing * Root) ) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Grazing e-09 *** Root < 2.2e-16 *** Grazing:Root Residuals The variables order matters in the F table
21 > drop1( lm(Fruit ~ Root * Grazing), test="F" ) Single term deletions Model: Fruit ~ Root * Grazing Df Sum of Sq RSS AIC F value Pr(F) Root:Grazing > drop1( lm(Fruit ~ Root + Grazing), test="F" ) Single term deletions Model: Fruit ~ Root + Grazing Df Sum of Sq RSS AIC F value Pr(F) Root < 2.2e-16 *** Grazing e-13 *** Ancova A safer test of significance, dropping each term from the model Akaike Information Criteria
22 > mod3 <- lm(Fruit ~ Root + Grazing) > summary(mod3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** Root < 2e-16 *** GrazingUngrazed e-13 *** --- Residual standard error: on 37 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 37 DF, p-value: < 2.2e-16 > anova(mod3) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Root < 2.2e-16 *** Grazing e-13 *** Residuals Ancova After simplifying the model, the p-values in the summary table agree with those in the F table
23 Are the assumptions met ? 1: Is the response variable continuous ? 2: Are the residuals normally distributed ? > shapiro.test(mod3$residuals) Shapiro-Wilk normality test data: mod3$resid W = , p-value = > plot(mod3, which=2) ## qqplot of std.residuals Answer: YES ! YES !
24 3a : Are the residuals independent and identically distributed? > plot(mod3, which=1) ## residuals vs fitted values. > plot(mod3, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. Answer: Looks OK
25 Any major influential points ? > plot(mod3, which=5) ## Residuals vs Leverages. Answer: No !