Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills.

Slides:



Advertisements
Similar presentations
Qualitative predictor variables
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
BA 275 Quantitative Business Methods
Review of Univariate Linear Regression BMTRY 726 3/4/14.
Lecture 4 Linear Models III Olivier MISSA, Advanced Research Skills.
Lecture 2 Linear Models I Olivier MISSA, Advanced Research Skills.
ANOVA: Analysis of Variation
Generalized Linear Models (GLM)
Multiple Regression Predicting a response with multiple explanatory variables.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Lesson #23 Analysis of Variance. In Analysis of Variance (ANOVA), we have: H 0 :  1 =  2 =  3 = … =  k H 1 : at least one  i does not equal the others.
Lesson #32 Simple Linear Regression. Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Analysis of variance (2) Lecture 10. Normality Check Frequency histogram (Skewness & Kurtosis) Probability plot, K-S test Normality Check Frequency histogram.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Ch. 14: The Multiple Regression Model building
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
Multiple Regression Analysis. General Linear Models  This framework includes:  Linear Regression  Analysis of Variance (ANOVA)  Analysis of Covariance.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 5 Linear Mixed Effects Models
Chapter 14 Introduction to Multiple Regression
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Regression and Analysis Variance Linear Models in R.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.
Regression Model Building LPGA Golf Performance
FACTORS AFFECTING HOUSING PRICES IN SYRACUSE Sample collected from Zillow in January, 2015 Urban Policy Class Exercise - Lecy.
Lecture 7 GLMs II Binomial Family Olivier MISSA, Advanced Research Skills.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Linear Models Alan Lee Sample presentation for STATS 760.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
EPP 245 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
The Effect of Race on Wage by Region. To what extent were black males paid less than nonblack males in the same region with the same levels of education.
Nemours Biomedical Research Statistics April 9, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
> xyplot(cat~time|id,dd,groups=ses,lty=1:3,type="l") > dd head(dd) id ses cat pre post time
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
Lecture 10 Linear models in R Trevor A. Branch FISH 552 Introduction to R.
Education 793 Class Notes ANCOVA Presentation 11.
Chapter 14 Introduction to Multiple Regression
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
12 Inferential Analysis.
CHAPTER 29: Multiple Regression*
Welcome to the class! set.seed(843) df <- tibble::data_frame(
Chapter 12 Simple Linear Regression and Correlation
12 Inferential Analysis.
Presentation transcript:

Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills

2 Outline "Refresher" on different types of model: Two-way Anova Ancova

3 When your observations are categorized according to two different factors. Example: Blood calcium concentration in a population of birds, 10 males and 10 females randomly split in 2 groups, and given either hormonal treatment or not. Two-Way Anova > dataset <- read.table("2wayAnova.csv", header=T, sep=",") > attach(dataset) > names(dataset) [1] "plasma" "sex" "hormone" > tapply(plasma, list(sex, hormone), length) no yes female 5 5 male 5 5 balanced design equal replications among groupings

4 Two-Way Anova > stripchart(plasma ~ hormone, +col=c("orange","blue") > h.ave <- tapply(plasma, hormone, mean) > h.ave no yes > stripchart(plasma ~ sex, +col=c("red","green") > s.ave <- tapply(plasma, sex, mean) > s.ave female male

5 Two-Way Anova > h.ave no yes > gd.ave <- mean(plasma); gd.ave [1] > summary(aov(plasma ~ hormone)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** Residuals > SS.h <- 10*( )^2 + ## sum of squares due to 10*( )^2 ## hormone treatment [1] > TSS <- sum((plasma-gd.ave)^2) [1]

6 Two-Way Anova > s.ave female male > gd.ave [1] > summary(aov(plasma ~ sex)) Df Sum Sq Mean Sq F value Pr(>F) sex Residuals > SS.s <- 10*( )^2 + ## sum of squares due to sex 10*( )^2 [1] > TSS <- sum((plasma-gd.ave)^2) [1]

7 Two-Way Anova > summary(aov(plasma ~ hormone)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** Residuals > summary(aov(plasma ~ sex)) Df Sum Sq Mean Sq F value Pr(>F) sex Residuals > summary(aov(plasma ~ hormone + sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** sex Residuals Analysing the two factors at the same time can make a big difference to the outcome

8 Two-Way Anova Including an interaction term is important too > summary(aov(plasma ~ hormone + sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** sex Residuals > summary(aov(plasma ~ hormone * sex)) Df Sum Sq Mean Sq F value Pr(>F) hormone e-07 *** sex hormone:sex Residuals

9 Two-Way Anova What is the meaning of this interaction term ? Assess whether the impact of the two factors are mutually interdependent. > gd.ave [1] > h.diff <- h.ave - gd.ave no yes > s.diff <- s.ave - gd.ave female male > predicted <- predict(aov(plasma~hormone+sex))

10 Two-Way Anova What is the meaning of the interaction term ? Assess whether the impact of the two factors are mutually interdependent. > pred.ave <- tapply(predicted, list(sex, hormone), mean) > pred.ave no yes female male > averages <- tapply(plasma, list(sex, hormone), mean) > averages no yes female male > sum((averages-pred.ave)^2)*5 [1] ## SS due to interaction

11 Two-Way Anova > averages no yes female male > barplot(averages, beside=T, col=c("red","green"),...) > pred.ave no yes female male > barplot(pred.ave, beside=T, col=c("red2","green3"),...) Graphical representations Predicted averages Observed averages

12 Two-Way Anova > interaction.plot(hormone, sex, plasma, col=c("red","green2"),... ) > interaction.plot(sex, hormone, plasma, col=c("orange","blue"),...) Graphical representations

13 Two-Way Anova as a linear model > mod <- lm(plasma ~ sex + hormone) > summary(mod) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-07 *** sexmale hormoneyes e-07 *** --- Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 17 DF, p-value: 1.307e-06 > pred.ave no yes female male

14 Two-Way Anova as a linear model > mod2 <- lm(plasma ~ sex * hormone) > summary(mod2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-06 *** sexmale hormoneyes e-05 *** sexmale:hormoneyes Residual standard error: on 16 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 16 DF, p-value: 7.89e-06 > averages no yes female male

15 Are the assumptions met ? 1: Is the response variable continuous ? 2: Are the residuals normally distributed ? > shapiro.test(mod$residuals) Shapiro-Wilk normality test data: mod$residuals W = , p-value = > plot(mod, which=2) ## qqplot of std.residuals Answer: YES ! YES !

16 3a : Are the residuals independent and identically distributed? > plot(mod, which=1) ## residuals vs fitted values. > plot(mod, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. Answer: Female treated with hormone seem to vary more in their response

17 When one predictor variable is continuous or discrete, and the other predictor variable is categorical. Example: Fruit production in a biennial plant, 40 plants were allocated to two treatments, grazed and ungrazed. The grazed plants were exposed to rabbits during the first two weeks of stem elongation, then allowed to regrow protected by a fence. > dataset2 <- read.table("ipomopsis.txt", header=T, sep="\t") > attach(dataset2) > names(dataset2) [1] "Root" "Fruit" "Grazing" > str(dataset2) 'data.frame': 40 obs. of 3 variables: $ Root : num $ Fruit : num $ Grazing: Factor w/ 2 levels "Grazed","Ungrazed": Ancova

18 > summary(dataset2) Root Fruit Grazing Min. : 4.43 Min. : 14.7 Grazed :20 1st Qu.: st Qu.: 41.1 Ungrazed:20 Median : 7.12 Median : 60.9 Mean : 7.18 Mean : rd Qu.: rd Qu.: 76.2 Max. :10.25 Max. :116.0 > stripchart(Fruit ~ Grazing, col=c("blue","green2"),...) > summary( lm(Fruit ~ Grazing) ) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** GrazingUngrazed * --- Residual standard error: on 38 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 38 DF, p-value: Ancova

19 Ancova > plot(Fruit ~ Root, col=c("blue","green2")[as.numeric(Grazing)],...) ## a few graphical functions omitted > summary( lm(Fruit ~ Root * Grazing) ) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-11 *** Root < 2e-16 *** GrazingUngrazed Root:GrazingUngrazed Residual standard error: on 36 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 36 DF, p-value: < 2.2e-16 slope slope 24.00

20 Ancova > anova( lm(Fruit ~ Root * Grazing) ) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Root < 2.2e-16 *** Grazing e-12 *** Root:Grazing Residuals > anova( lm(Fruit ~ Grazing * Root) ) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Grazing e-09 *** Root < 2.2e-16 *** Grazing:Root Residuals The variables order matters in the F table

21 > drop1( lm(Fruit ~ Root * Grazing), test="F" ) Single term deletions Model: Fruit ~ Root * Grazing Df Sum of Sq RSS AIC F value Pr(F) Root:Grazing > drop1( lm(Fruit ~ Root + Grazing), test="F" ) Single term deletions Model: Fruit ~ Root + Grazing Df Sum of Sq RSS AIC F value Pr(F) Root < 2.2e-16 *** Grazing e-13 *** Ancova A safer test of significance, dropping each term from the model Akaike Information Criteria

22 > mod3 <- lm(Fruit ~ Root + Grazing) > summary(mod3) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** Root < 2e-16 *** GrazingUngrazed e-13 *** --- Residual standard error: on 37 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 37 DF, p-value: < 2.2e-16 > anova(mod3) Analysis of Variance Table Response: Fruit Df Sum Sq Mean Sq F value Pr(>F) Root < 2.2e-16 *** Grazing e-13 *** Residuals Ancova After simplifying the model, the p-values in the summary table agree with those in the F table

23 Are the assumptions met ? 1: Is the response variable continuous ? 2: Are the residuals normally distributed ? > shapiro.test(mod3$residuals) Shapiro-Wilk normality test data: mod3$resid W = , p-value = > plot(mod3, which=2) ## qqplot of std.residuals Answer: YES ! YES !

24 3a : Are the residuals independent and identically distributed? > plot(mod3, which=1) ## residuals vs fitted values. > plot(mod3, which=3) ## sqrt(abs(standardized(residuals))) vs fitted values. Answer: Looks OK

25 Any major influential points ? > plot(mod3, which=5) ## Residuals vs Leverages. Answer: No !