BIO503: Lecture 4 Statistical models in R Stefan Bentink

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Inference for Regression
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
BIO503: Lecture 4 Statistical models in R --- Recap --- Stefan Bentink
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 13 Multiple Regression
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 10 Simple Regression.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
BIO503: Lecture 4 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
BCOR 1020 Business Statistics
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Simple Linear Regression and Correlation
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Simple Linear Regression Analysis
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Lecture 5 Correlation and Regression
Example of Simple and Multiple Regression
SAS Lecture 5 – Some regression procedures Aidan McDermott, April 25, 2005.
Regression and Correlation Methods Judy Zhong Ph.D.
Inference for regression - Simple linear regression
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Hypothesis Testing in Linear Regression Analysis
Simple Linear Regression Models
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Chapter 14 Introduction to Multiple Regression
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
CHAPTER 14 MULTIPLE REGRESSION
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Simple Linear Regression ANOVA for regression (10.2)
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 13 Multiple Regression
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Canadian Bioinformatics Workshops
Stats Methods at IC Lecture 3: Regression.
Chapter 13 Simple Linear Regression
CHAPTER 7 Linear Correlation & Regression Methods
Chapter 11 Simple Regression
Correlation and Regression
CHAPTER 29: Multiple Regression*
Prepared by Lee Revere and John Large
Presentation transcript:

BIO503: Lecture 4 Statistical models in R Stefan Bentink

Statistical tests in R Just some examples: > t.test() > pairwise.t.test() > chisq.test() > fisher.test() > ks.test() > …

One sample t-test > data(ChickWeight) > t.test(ChickWeight[, 1], mu = 100) One Sample t-test data: ChickWeight[, 1] t = , df = 577, p-value = 5.529e-13 alternative hypothesis: true mean is not equal to percent confidence interval: sample estimates: mean of x

Two sample t-test > t.test(ChickWeight$weight[ChickWeight$Diet == "1"], + ChickWeight$weight[ChickWeight$Diet =="2"]) Welch Two Sample t-test data: ChickWeight$weight[ChickWeight$Diet == "1"] and ChickWeight$weight[ChickWeight$Diet == "2"] t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

Linear Regression Models residual error regression coefficient dependent variable intercept independent variable Using the methods of least squares, we can derive the following estimators: Our goal is to test the hypothesis: We can do this with a T test: under the null hypothesis, this follows a T distribution with (n-1) df.

Installing ISwR Package Please install the ISwR package on your computer. This package contains all the data sets used in Peter Dalgaard's book Introductory Statistics with R. To load the package into your current R session: > library(ISwR) To find out more information, including what objects are contained in a package: > library(help=ISwR)

Example Dataset tlc We'll be using the dataset tlc that exists in ISwR package. To load this dataset: > data(tlc) # tlc = total lung capacity What kind of object is tlc ? > class(tlc) To learn about this dataset: > help(tlc) By using the attach command, we release the columns of the data.frame into the workspace. > attach(tlc) > age

Linear Regression with lm Is there a linear relationship between height and Total Lung Capacity (TLC)? > lmObject <- lm(tlc ~ height, data=tlc) What kind of object is lmObject ? > class(lmObject) The model object represents an object that encapsulates the results of a model fit. The desired quantities can be obtained using extractor functions. A basic extractor function is summary : > summary(lmObject)

Interpreting the Output from lm Call:... Stores the call to the lm function. Handy for keeping track of what model you fit. Residuals:... Coefficients:... Summary stats of the residuals from the model. Recall: residuals should follow approximately a standard Normal distribution if the model is a good fit. Estimates from the model. Standard error. T statistics. P-value. Residual standard error:... Residual variance Plug in estimates, to get RSE:

Interpreting the Output from lm Multiple R- squared:... Pearson correlation coefficient 2 of y, x (e.g. tlc, height). Do cor(tlc[,4], height))^2 to check. F statistic: This is the F test that the regression coefficient is zero. Adjust R- squared:...

Visualizing Our Fitted Model So what does it really mean, to have fit a linear model of TLC and height? Plot the data: > TLC <- tlc[,4] > plot(height, TLC) Add the regression line we fit with our model: > abline(lmObject)

Values Fitted by the Linear Model Retrieve the fitted values using the extractor function fitted: > fitted(lmObject) Convince yourself that these are the points that fall on the line we just made in the previous plot. > plot(height, TLC) > abline(lmObject) > points(height, fitted(lmObject), pch=20, col="red") To grab the residual values: > resid(lmObject)

Eyeballing Good Fit We can use the information from the residuals and fitted values to create plots to see how good our model fit is. > plot(height, TLC) > abline(lmObject) Use the segments function to draw vertical lines from the fitted values to the real data values. > segments(height, fitted(lmObject), height, TLC) We can also take a look at the residuals. > plot(height, resid(lmObject)) And use a QQ plot to see if the residuals are normally distributed. > qqnorm(resid(lmObject)) > qqline(resid(lmObject))

Confidence Interval Bands Confidence bands are added to regression lines to reflect the uncertainty about a true parameter that cannot be observed i.e. the true regression coefficient . The more narrow the confidence band is, suggests a well- determined line. Using the predict function without any other input arguments yields the fitted values predicted by the model. > predict(lmObject) To compute the confidence interval bands for the fitted values, you need to specify the interval argument: > predict(lmObject, interval="confidence")

Visualizing Confidence Interval Bands Go ahead and plot these values: > pp <- predict(lmObject, interval="confidence") > plot(height, pp[,1], type="l", ylim=range(pp)) > lines(height, pp[,2], col="red", lwd=3) > lines(height, pp[,3], col="blue", lwd=3) What's the problem?

Predicting on New Data Our problem will be solved if height was ordered sequentially. One solution: predict fitted values on new (ordered) height values. Create a sequence of numbers that go from min(height) to max(height) approximately. > range(height) > new <- data.frame(height = seq(from=120, to=200, by=2)) Compute new predictions: > pp.new <- predict(lmObject, new, interval="confidence")

Plotting Confidence Interval Bands Now we can make the plot. First the fitted values: > plot(new$height, pp.new[,1], type="l") Then the upper interval band: > lines(new$height, pp.new[,2], col="red", lwd=2) Finally the lower interval band: > lines(new$height, pp.new[,3], col="blue", lwd=2)

Prediction Interval Bands Prediction interval bands reflect the uncertainty about future observations. Prediction bands are generally wider than the confidence interval bands. > predict(lmObject, interval="prediction") Plot the fitted data with the prediction interval bands and confidence interval bands superimposed. > pp.pred <- predict(lmObject, new, interval="prediction") > pp.conf <- predict(lmObject, new, interval="confidence")

Visualizing Fitted Values, Prediction and Confidence Bands Note: instead of using lines individually, we can also use the function matlines which plots columns of a matrix. > help(matlines) > plot(new$height, pp.pred[,1], ylim=range(pp.pred, pp.conf)) Add the prediction bands: > matlines(new$height, pp.pred[,-1], type=c("l", "l"), lwd=c(3,3), col=c("red", "blue"), lty=c(1,1)) Add the confidence bands: > matlines(new$height, pp.conf[,-1], type=c("l", "l"), lwd=c(3,3), col=c("red", "blue"), lty=c(2,2))

Tutorial 1

ANOVA Models The (one-way) ANOVA model generalizes linear regression. –One factor that has G levels. –For each level we have J replicates. j = 1 j = J J replicates i = 1i = Gi = 2… … G levels ANOVA Model:

Analysis of Variance Models j = 1 j = J J replicates i = 1i = Gi = 2… … G levels Total variation: Variation between groups: Variation within groups: ANOVA is all about splitting variation up.

ANOVA Models Question: is there a significant relationship between our factor A and the response variable Y? If yes, then ideally the variation within groups should be small. The variation between groups should be large. Variation between groups: Variation within groups: Our test statistic: General idea: Under the null hypothesis that the factor A has no effect: F = 1. Large values of F indicate the factor is significant. B = # levels W = n - B

ANOVA Example Let's use a different dataset: > library(MASS) > data(ChickWeight) > attach(ChickWeight) The factor Diet has 4 levels. > levels(Diet) > anova(lm(weight ~ Diet, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet e-07 Residuals

Two-way ANOVA We can fit a two-way ANOVA: > anova(lm(weight ~ Diet + Chick, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet e-07 Chick Residuals The interpretation of the model output is sequential, from the bottom to the top. This line tests the model: weight ~ Diet + Chick This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.

Tutorial 2

Multiple Linear Regression  Multiple explanatory variables to explain a response variable.  Can I explain the values of the response variable by the levels of the explanatory variables?  Do I need all explanatory variables to explain the response variable?

Specifying Models In R we use model formula to specify the model we want to fit to our data. y ~ x Simple Linear Regression y ~ x – 1 Simple Linear Regression without the intercept (line goes through origin) y ~ x1 + x2 + x3 Multiple Regression y ~ x + I(x^2) Quadratic Regression log(y) ~ x1 + x2 Multiple Regression of Transformed Variable For factors A, B: y ~ A 1-way ANOVA y ~ A + B 2-way ANOVA y ~ A*B 2-way ANOVA + interaction term

Fit multiple regression model to a data.frame > ##Get data.frame > cfseal <- read.table("cfseal.txt", header=T, + sep="\t") > heart.log <- log(cfseal$heart) > cfseal.log <- cfseal > cfseal.log[,1] <- heart.log > colnames(cfseal.log)[1] <- "heart.log" > ##Fit model > seal.lm <- lm(heart.log ~., data=cfseal.log)

Update models and model selection Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step, stepAIC Similarly, anova(modObj, test="Chisq")

Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where Logit

Logistic regression

Problem 2 – Logistic Regression Read in the anaesthetic data set, data file: anaesthetic.txt. Covariates: move binary numeric vector for patient movement (1 = movement, 0 = no movement) conc anaethestic concentration Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.

Fit the Logistic Regression Model > anes.logit <- glm(move ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) ** conc ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Estimating Log Odds Ratio To get back the log odds ratio > anes.logit$linear.predictors > plot(anesthetic$conc, anes.logit$linear.predictors) > abline(coefficients(anes.logit)) Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.

Problem 3 – Multiple Logistic Regression Read in data set birthwt.txt. We fit a logistic regression using the glm function and using the binomial family.

Problem 4 - Poisson Regression Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease. Example: schooldata.csv. We can fit the Poisson regression model using the glm function and the poisson family.

Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11])~1) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)

Survival analysis > cp <- coxph(Surv(aml$time, + aml$status)~x,data=aml) > > summary(cp) > > plot(survfit(Surv(aml$time,aml$status)~x, + data=aml),col=c("red","green"),lwd=2)