Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.

Slides:



Advertisements
Similar presentations
Using R Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
ANOVA: Analysis of Variation
Proportion Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Generalized Linear Models (GLM)
Multiple Regression Predicting a response with multiple explanatory variables.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Count Data Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Ch. 14: The Multiple Regression Model building
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
Contrasts Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Multiple Regression Analysis. General Linear Models  This framework includes:  Linear Regression  Analysis of Variance (ANOVA)  Analysis of Covariance.
Regression Transformations for Normality and to Simplify Relationships U.S. Coal Mine Production – 2011 Source:
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Introduction to Linear Regression and Correlation Analysis
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
© Department of Statistics 2012 STATS 330 Lecture 18 Slide 1 Stats 330: Lecture 18.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
Comparing Two Samples Harry R. Erwin, PhD
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Summary of Remainder Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
23-1 Multiple Covariates and More Complicated Designs in ANCOVA (§16.4) The simple ANCOVA model discussed earlier with one treatment factor and one covariate.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Variance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Use of Weighted Least Squares. In fitting models of the form y i = f(x i ) +  i i = 1………n, least squares is optimal under the condition  1 ……….  n.
Statistical Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Regression and Analysis Variance Linear Models in R.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Lecture 11 Multicollinearity BMTRY 701 Biostatistical Methods II.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II.
Environmental Modeling Basic Testing Methods - Statistics III.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Determining Factors of GPA Natalie Arndt Allison Mucha MA /6/07.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Lecture 6: Multiple Linear Regression Adjusted Variable Plots BMTRY 701 Biostatistical Methods II.
Binary Response Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Multiple Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Lecture 3 Linear Models II Olivier MISSA, Advanced Research Skills.
Linear Models Alan Lee Sample presentation for STATS 760.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Nemours Biomedical Research Statistics April 9, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
1 Analysis of Variance (ANOVA) EPP 245/298 Statistical Analysis of Laboratory Data.
ANOVA: Analysis of Variation
ANOVA: Analysis of Variation
ANOVA: Analysis of Variation
Chapter 14 Introduction to Multiple Regression
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
Correlation and regression
Analysis of Variance Harry R. Erwin, PhD
Checking Regression Model Assumptions
Checking Regression Model Assumptions
Welcome to the class! set.seed(843) df <- tibble::data_frame(
Chapter 12 Simple Linear Regression and Correlation
Estimating the Variance of the Error Terms
Presentation transcript:

Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland

Resources Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley. Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press. Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Introduction Analysis of covariance (ANCOVA) combines regression and ANOVA –Response variable is continuous –One or more explanatory factors (the treatments) –One or more continuous explanatory variables Usually done in a treatment study where explanatory variables are being included to improve the basic treatment/control comparison. Interaction between the slope for an explanatory variable and the treatment is not wanted. (Life is hard.) Maximal model includes estimating slopes and intercepts for each combination of the explanatory factors. Model simplification is the goal.

Context The goal of analysis of covariance is to reduce the error variance. This increases the power of tests and narrows the confidence intervals. There may be measurable variables that affect the response but have nothing to do with the factors (treatments) in the experiment. Analysis of covariance adjusts for those variables.

The Covariance Model For one treatment factor and one continuous control variable, x ij, the model is: –y ij =  0 +  i +  1 x ij +  ij This says the response is a constant (  0 ) plus a second constant (  i, depending on the factor) plus a third constant (  1 ) times the control variable (or covariate) plus an error (  ij ). The interest is in the difference between the treatment means (the  i ), not in the  0 or  1. You want to be able to reduce your model.

Assumptions in ANCOVA 1.The covariate x ij is not affected by the experimental factors. 2.The regression relationship measured by  1 must be the same for all factor levels. You need to verify these assumptions.

General Approach to ANCOVA First look at the effect of x ij. If it isn’t significant, do an ANOVA and be done with it. Check to see that x ij is not significantly affected by the factor values. Test to see that  1 is not significantly different for all factor levels. This is an interaction (a bad thing) between the factors and the covariates. Order matters: the covariates come after the factors in the model because they’re less important. If both tests pass, do the ANCOVA.

Example Response variable is weight Explanatory factor is sex Continuous explanatory variable is age. –weight male = a male + b male  age –weight female = a female + b female  age Six possible models. The goal is to eliminate as many parameters as possible. Reduce the model until all parameters are significant.

Book Example Notes –Use of plots to get insight into the significance of explanatory variables. –Note use of lm() in the models. It produces the same results as aov(), but with a different report. –Order matters—non-orthogonal data! –Use of summary.aov() –Eliminate interactions first. –anova() used in comparisons. –summary.lm() to provide the parameter estimates

Background This experiment studies the ability of a plant to regrow and produce seeds after grazing. The pregrazing size is the diameter of the top of the rootstock Grazing has two levels: grazed or ungrazed. Response is weight of seeds produced at the end of the growing season. Size of plant is believed to matter and also whether it was grazed.

Step 1 compensation<- read.table("compensation.txt",header=T) attach(compensation) names(compensation) [1] "Root" "Fruit" "Grazing” par(mfrow=c(2,2)) plot(Root,Fruit) plot(Grazing,Fruit)

Plot 1

Step 2 model<-lm(Fruit~Root*Grazing)  wrong way--inflates Grazing sum of sqs! summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) Root < 2.2e-16 *** Grazing e-12 *** Root:Grazing Residuals model<-lm(Fruit~Grazing*Root)  correct way! Grazing is more important. summary.aov(model) Df Sum Sq Mean Sq F value Pr(>F) Grazing e-09 *** Root < 2.2e-16 *** Grazing:Root Residuals

Check to see if the interaction term is important model2<-lm(Fruit~Grazing+Root) anova(model,model2)  use anova to compare models Analysis of Variance Table Model 1: Fruit ~ Grazing * Root Model 2: Fruit ~ Grazing + Root  simpler model Res.Df RSS Df Sum of Sq F Pr(>F)

Report summary.lm(model2) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** GrazingUngrazed e-13 *** Root < 2e-16 *** Residual standard error: on 37 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 2 and 37 DF, p-value: < 2.2e-16 Row 1 is the intercept for the factor level first in the alphabet (Grazed as opposed to Ungrazed). Row 2 is the difference Ungrazed – Grazed. Row 3 is the slope of the graph of seed production against rootstock size. Row 4 (when present) is the difference in slopes if the interaction term is significant. (Not significant here! 8)

What’s Going On? sf<-split(Fruit,Grazing) sr<-split(Root,Grazing) plot(Root,Fruit,type="n",ylab="Seed production",xlab="Initial root diameter") points(sr[[1]],sf[[1]],pch=16) points(sr[[2]],sf[[2]]) plot(Root,Fruit,type="n",ylab="Seed production",xlab="Initial root diameter") points(sr[[1]],sf[[1]],pch=16) points(sr[[2]],sf[[2]]) abline( ,23.56) abline( ,23.56,lty=2)

Plot 2

Suppose we ignored the initial root size? tapply(Fruit,Grazing,mean) Grazed Ungrazed  the opposite of the true situation! summary(aov(Fruit~Grazing)) Df Sum Sq Mean Sq F value Pr(>F) Grazing * Residuals Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Order Matters for Non-Orthogonal Data! The total variation in the response (SSY) is equal to the sum of the: –Variation explained by the treatment (SSA), plus the –Variation explained by the covariate, plus the –Variation explained by the interaction between the factor levels and the covariate (hopefully small), plus the –Variation explained by the error term. Since the factor levels and the covariate are dependent in non-orthogonal data, fitting the covariate first inflates the variation explained by the treatment, potentially producing an invalid positive result. So put the treatment variable first in the model.

Because Order Matters! Do you fit the categorical (treatment, T) or the continuous (control, L) explanatory variable first? With non-orthogonal data, order matters. Use a logical order. Hence fit to the treatment variable first. You’re interested in the effect of the treatment, not of the control variable. If the interaction between the treatment and control variables is significant, stop! It means the slopes differ significantly, which is a (nasty) problem.

Reading the Summary summary.lm(model2) Call: lm(formula = Fruit ~ Grazing + Root) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) * e-15 *** GrazingUngrazed ** e-13 *** Root *** < 2e-16 *** Residual standard error: on 37 degrees of freedom Multiple R-Squared: ,Adjusted R-squared: F-statistic: on 2 and 37 DF, p-value: < 2.2e-16

Using split() Applies to a vector or dataframe. sd<-split(d,f) divides the data in a dataframe (or vector), d, based on the factor, f. sd will be a list of vectors. Each vector in the list will correspond to a value of the factor (in alphabetical order). Each vector in sd can be plotted using its own symbol to give insight into the differences between factors. Book example.

The Moral If you have covariates, use them. They will improve your confidence intervals or identify that you have a problem. Order matters—(it always does in regression). Start by removing the highest order interaction terms first. Use a logical order. If the treatment (categorical) interacts significantly with the control (continuous), stop!