Linear Models in R Fish 552: Lecture 10. Supplemental Readings Practical Regression and ANOVA using R (Faraway, 2002) –Chapters 2,3,6,7,8,10,16 –http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Generalized Linear Models (GLM)
Chapter 13 Multiple Regression
Multiple Regression Predicting a response with multiple explanatory variables.
Multiple regression analysis
Lecture 23: Tues., Dec. 2 Today: Thursday:
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24: Thurs., April 8th
Simple Linear Regression Analysis
Multiple Linear Regression
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Inferences About Process Quality
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Simple Linear Regression Analysis
Relationships Among Variables
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Lecture 15 Basics of Regression Analysis
Objectives of Multiple Regression
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Introduction to Linear Regression and Correlation Analysis
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Chapter 13: Inference in Regression
9/14/ Lecture 61 STATS 330: Lecture 6. 9/14/ Lecture 62 Inference for the Regression model Aim of today’s lecture: To discuss how we assess.
Analysis of Covariance Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
1 Chapter 3 Multiple Linear Regression Multiple Regression Models Suppose that the yield in pounds of conversion in a chemical process depends.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
Go to Table of Content Single Variable Regression Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Regression Model Building LPGA Golf Performance
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.3 Two-Way ANOVA.
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Linear Models Alan Lee Sample presentation for STATS 760.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
ANOVA and Multiple Comparison Tests
Lecture 10 Linear models in R Trevor A. Branch FISH 552 Introduction to R.
Stats Methods at IC Lecture 3: Regression.
CHAPTER 7 Linear Correlation & Regression Methods
Essentials of Modern Business Statistics (7e)
CHAPTER 29: Multiple Regression*
Presentation transcript:

Linear Models in R Fish 552: Lecture 10

Supplemental Readings Practical Regression and ANOVA using R (Faraway, 2002) –Chapters 2,3,6,7,8,10,16 – –(This is essentially an older version (but a free copy) of Julian Faraway’s excellent book Linear Models with R) QERM 514 Lecture Notes (Nesse, 2008) –Chapters 3-9 – –Hans Nesse practically wrote a book for this course !

Linear models Today’s lecture will focus on a single equation and how to fit it. –Classic linear model When the response, y i is normally distributed –Categorical predictor (s) : ANOVA –Continuous predictor (s) : Regression –Mixed Categorical/Continuous predictor (s) : Regression/ANCOVA

Typical Goals Is this model for... –Estimation : What parameters in a particular model best fit the data? Tricks (e.g. Formulaic relationships, Ricker,... ) –Inference: How certain are those estimates and what can be interpreted from them? –Adequacy: Is the model probably the right choice? –Prediction: When can predictions be made for new observations?

Outline ANOVA Regression (ANCOVA) Model Adequacy * This lecture will not go into detail on statistical models or concepts, but rather present functions to fit those models

One-way ANOVA The classical (null) hypothesis in a one-way ANOVA is that the means of all the groups are the same –i : 1, 2,...., n j is the observation within group j: 1,2,... J H 0 : μ 1 = μ 2 =.... = μ J H 1 : At least two of the μ j ’s are different

Archaeological metals Archaeological investigations work to identify similarities and differences between sites. Traces of metals found in artifacts give some indication of manufacturing techniques The data metals gives the percentage of iron found in pottery from four Roman-era sites > metals <- read.table(" header = TRUE) > head(metals, n = 3) Al Fe Mg Ca Na Site L L L Site will automatically get coded as a factor

The model statement We fit the ANOVA model by specify a model: –Fe ~ Site The functions in R that fit the more common statistical models take as a first argument a model statement in a compact symbolic form We’ve actually briefly seen this symbolic form in the first plotting lecture –plot(y ~ x, data = ) Predictor, independent variable Response, dependent variable

The model statement Look up help on ?formula for a full explanation Some common model statements formula Description y ~ x1 -1- means leave some thing out. Fit the slope but not the intercept y ~ x1 + x2 Model with covariates x 1 and x 2 y ~ x1 + x2 + x1:x2 Model with covariates x 1 and x 2 and an interaction between x1:x2 y ~ x1 * x2* denotes factor crossing. So this equivalent to the above statement y ~ (x1 + x2 + x3)^2^ indicates crossing to the specified degree. Fit the 3 main effects for x 1, x 2, and x 3 with all possible second order interactions y ~ I(x1 + x2)I means treat something as is. So the model with single covariate which is the sum of x 1 and x 2. (This way we don’t have to create the variable x 1 + x 2 )

aov() / summary() The simplest way to fit an ANOVA model is with the aov function > Fe.aov <- aov(Fe ~ Site, data = metals) > summary(Fe.aov) Df Sum Sq Mean Sq F value Pr(>F) Site e-12 *** Residuals Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Degrees of freedom for each group and the residuals Mean squared error – sum of squares divided by degrees of freedom Partial F-statistic comparing the full model to the reduced model Probability of observing F or higher Significance – a quick visual guide to which p-values are low Sums of squares group – sum of squared difference between group means and overall means Sums of squares error – sum of squared difference between observations within a group and their respective mean

One-way ANOVA In the previous function we fit the model with the aov() function. We can also fit the ANOVA model in with the functions lm() and anova(). Depending on what analysis we are conducting we might chose either one of these approaches. > Fe.lm <- lm(Fe ~ Site, data=metals) > anova(Fe.lm) Analysis of Variance Table Response: Fe Df Sum Sq Mean Sq F value Pr(>F) Site e-12 *** Residuals Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Coding categorical variables The lm() function is also used to fit regression models. (In fact, regression and ANOVA are really the same thing). It all has to do with how a categorical variable is coded in the model. The ANOVA model can be written in a the familiar looking form by cleverly selecting the predictors to be Groupx1x1 x2x2 x3x3..x J J Treatment coding

treatment coding Coding schemes describe how each group is represented as the values of x 1, x 2,.... x J In R the default coding scheme for unordered factors is the treatment coding. This is likely what you learned in your introductory statistics courses. –Recall that in this scheme the estimate of the intercept β 0 represents the mean of the baseline group and that the estimate of the of each β J describes the difference between the mean of group i and the baseline group

Coding schemes Treatment coding μ 1 =β0β0 μ 2 =β 0 + β 1 μ 3 =β 0 + β 2.. μ J =β 0 + β J-1 This may seem trivial but its very important to know how categorical variables are being coded in a linear model when interpreting parameters. To find out the current coding scheme > options()$contrasts unordered ordered "contr.treatment" "contr.poly" μ1 is the group chosen as the baseline

Other coding schemes There are several other coding schemes –hermet : Awkward interpretation. Improves matrix computations. –poly : Levels of a group are ordered. β 0 = constant effect, β 1 = linear effect, β 2 = quadratic effect,... –SAS : same as treatment, but the last level of a group is used as the baseline (the first level will always be used in treatment –sum : When the group sample sizes are equal, the estimate of the intercept represents the grand mean and the β J represent the differences of those levels from the grand mean

Changing coding schemes The C function is used to specify the contrast of a factor –C(object, contr) > ( metals$Site <- C(metals$Site, sum) ) [1] L L L L L L L L L L L L L L C C I I I I I A A A A A attr(,"contrasts") [1] contr.sum Levels: A C I L The functions contr.() ( e.g. contr.treatment) will create the matrix of contrasts used in lm() and other functions

One-way ANOVA > summary(Fe.lm) Call: lm(formula = Fe ~ Site, data = metals) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-05 *** SiteC e-06 *** SiteI SiteL e-12 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 22 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 22 DF, p-value: 1.679e-12 More output Recall that this t-test tests whether the β J is significantly different from zero. So this says that sites C and L were significantly different from the baseline site, A Residual summary Got to report R- squared, right?

Multiple comparisons When comparing the means for the levels of a factor in an analysis of variance, a simple comparison using t-tests will inflate the probability of declaring a significant difference when it is not in fact present There are several ways around this the most common being Tukey’s honest significant difference –TukeyHSD(object) This needs to be a fitted model object from aov()

TukeyHSD() > TukeyHSD(Fe.aov) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = Fe ~ Site, data = metals) $Site diff lwr upr p adj C-A I-A L-A I-C L-C L-I These should not contain zero for a significant difference

plot(TukeyHSD(Fe.aov))

Model assumptions 1.Independence Within and between samples 2.Normality Histograms, QQ-Plots Tests for normality Kolmogorov-Smirnov test : ks.test() Shapiro-Wilk: shapiro.test() 3.Homogeneity of variance Boxplots Tests for equal variances Bartlett’s test: bartlett.test() Fligner-Killeen Test: fligner.test() The null hypothesis is that the data follow a specific distribution The null hypothesis is that the data came from a normal distribution The null hypothesis is that all the variances are equal Load the MASS library

In-class exercise 1 Check the assumptions using plots and tests for the archeological ANOVA model. Recall that the normality test is conducted on the residuals of the model so you will need to figure out how to extract these from Fe.lm –Are the assumptions met ?

ANCOVA / Regression With a basic understanding of the lm() function it’s then not hard to fit other linear models Regression models with mixed categorical and continuous variables can be fit the with lm() function. There are also a suite of functions associated with lm() objects which we use for common model evaluation and prediction routines.

Marmot data The length of yellow-bodied marmot whistles in response to simulated predators. > head(marmot) len rep dist type loc Human A Human A Human A Human A Human A Human A

Marmot data len : length of marmot whistle (response variable) rep : number of repetitions of whistle per bout - continuous dist : distance to challenge when whistle began – continuous type : type of challenge (Human, RC Plane, Dog) - categorical loc: test location (A, B, C) – categorical

Exploring potential models Basic exploratory data analysis should always be performed before starting to fit a model Always try and fit a meaningful model When there are at least two categorical predictors an interaction plot is useful for determining whether the effect of x 1 on y depends on the level of x 2 (an interaction) –interaction.plot(x.factor, trace.factor, response)

Slight evidence for an interaction No RCPlanes were test at location C. This will prevent us from fitting an interaction between these two variables. interaction.plot(marmot$loc, marmot$type, marmot$len,...

Exploring potential models We can also examine potential interactions between continuous and categorical variables with simple bivariate plots conditioned on factors plot(dist, len,xlab="Distance", ylab="Length", type='n') points(dist[type == "Dog"], len[type == "Dog"],pch=17, col="blue") points(dist[type == "Human"], len[type == "Human"],pch=18, col="red") points(dist[type == "RCPlane"], len[type == "RCPlane"], pch=19,col="green") legend("bottomleft", bty='n', levels(type), + col=c("blue", "red", "green"), pch=17:19) Setup a blank plot Quick way to extract names of categorical variables Add points for x and y by some factor

Humans have a much stronger linear effect

A model Suppose after some exploratory data analysis and model fitting we arrived at the model: –Length ~ Location + Distance + Type + Distance*Type We can fit this model simply by > ( interactionModel <- lm(len ~ loc + type*dist, data = marmot) ) Call: lm(formula = len ~ loc + type * dist, data = marmot) Coefficients: (Intercept) locB locC typeHuman typeRCPlane dist typeHuman:dist typeRCPlane:dist

The summary() command works just as before > summary(interactionModel) Call: lm(formula = len ~ loc + type * dist, data = marmot) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-15 *** locB locC typeHuman * typeRCPlane dist typeHuman:dist ** typeRCPlane:dist Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 132 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 7 and 132 DF, p-value: 8.208e-08

Extracting model components All the components to the summary() output are also stored in a list > names(interactionModel) [1] "coefficients" "residuals" "effects" [4] "rank" "fitted.values" "assign" [7] "qr" "df.residual" "contrasts" [10] "xlevels" "call" "terms" [13] "model" > interactionModel$call lm(formula = len ~ loc + type * dist, data = marmot)

Comparing models Suppose that the model without the interaction was also a potential model. We can look at the t-values to test whether a single β J = 0, but need to perform a partial F-test to test whether several predictors = 0. –This is what an ANOVA model tests. –H 0 : Reduced model – H 1 :Full model > anova(nonInteractionModel, interactionModel) Analysis of Variance Table Model 1: len ~ loc + type + dist Model 2: len ~ loc + type * dist Res.Df RSS Df Sum of Sq F Pr(>F) *** Two lm() objects need to be given to anova() Evidence for interaction

AIC AIC is a more sound way to select a model. In it’s most simple form, p = # of parameters The function AIC() will extract the AIC from a linear model. Note that there is also the function extractAIC() which evaluates the log-likelihood based on the model deviance (Generalized Linear Models) and uses a different penalty. –Be careful !

AIC c AIC corrected is a better choice, especially when there a lot of parameters in relationship to the size of the data AIC c should always be used since AIC and AIC c should yield equivalent results as n gets large. Correction term. What will this converge to as n gets very large ?

Hands-on exercise 2 Compute the AIC and AICc for the two marmot models that were fit –For AIC use the AIC() function, but also do the computation by hand –R does not have a built in function for AICc in its base package, so you will also have to do this computation by hand. Hint: Use the logLik() function

Checking assumptions Suppose that we decided on the marmot model that included the interaction term between Distance and type of challenge The same assumptions must be met and can be evaluated by plotting the model object –plot(interactionModel) –Clicking on the plot will allow to scroll through plots –Specifying which = in the command will allow to select which plot plot(interactionModel, which = 1)

Checking constant variance assumption Unusual observations are flagged (High leverage)

Very heavy tails. Normality assumption is not met.

Similar to fitted v. residual plot

Check for influential observations Cook’s distance is a function of the residual and leverage, so the isoclines trace out the Cook’s distance for any point in a region

Parameter confidence intervals Confidence intervals for model parameters can easily be obtained with the confint() function > confint(interactionModel) 2.5 % 97.5 % (Intercept) locB locC typeHuman typeRCPlane dist typeHuman:dist typeRCPlane:dist

Other useful functions addterm: Forward selection using AIC dropterm: Backwards selection using AIC stepAIC: Step-wise selection using AIC cooks.distance: Cook’s distance (use to check for influential observations) predict: Use model to predict future observations