Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Inference for Regression
Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
12 Multiple Linear Regression CHAPTER OUTLINE
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 10 Simple Regression.
Statistics for Managers Using Microsoft® Excel 5th Edition
Lecture 23: Tues., April 6 Interpretation of regression coefficients (handout) Inference for multiple regression.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression Analysis
Linear Regression/Correlation
Correlation & Regression
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Regression and Correlation Methods Judy Zhong Ph.D.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Inference for regression - Simple linear regression
Chapter 13: Inference in Regression
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
OPIM 303-Lecture #8 Jose M. Cruz Assistant Professor.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Review of Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
11 Chapter 5 The Research Process – Hypothesis Development – (Stage 4 in Research Process) © 2009 John Wiley & Sons Ltd.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Regression Analysis Presentation 13. Regression In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships.
Canadian Bioinformatics Workshops
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
Inference for Least Squares Lines
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Regression
CHAPTER 29: Multiple Regression*
6-1 Introduction To Empirical Models
Linear Regression/Correlation
Hypothesis testing and Estimation
Simple Linear Regression
Multiple Regression Chapter 14.
Simple Linear Regression
CHAPTER 12 More About Regression
Linear Regression and Correlation
Linear Regression and Correlation
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 4 Regression

Module 4: Regression bioinformatics.ca Regression What is regression? One of the most widely used statistical methodology. Describes how one variable (or set of variables) depend on another variable (or set of variables) Example: – Weight vs height – Yield vs fertilizer

Module 4: Regression bioinformatics.ca Outline Introduction Simple Linear Regression Multiple Linear Regression For both cases we will discuss – Assumptions. – Fitting a model in R and interpreting output. – Model assessment. Some model selection procedures.

Module 4: Regression bioinformatics.ca Regression – Regression aims to predict a response that could take on continuous values. – Often characterized as a quantitative prediction rather than qualitative. – Simple linear regression is a part of a much more general methodology: Generalized Linear Models. – Very closely related to t-test and ANOVA.

Module 4: Regression bioinformatics.ca Simple Linear Regression Model Linear regression assumes a particular model:  i are "errors" - not in the sense of being "wrong", but in the sense of creating deviations from the idealized model. The  i are assumed to be independent and N(0,  2 ) (normally distributed), they can also be called residuals. This model has two parameters: the regression coefficient , and the intercept . x i is the independent variable. Depending on the context, also known as a "predictor variable," "regressor," "controlled variable," "manipulated variable," "explanatory variable," "exposure variable," and/or "input variable." y i is the dependent variable, also known as "response variable," "regressand," "measured variable," "observed variable," "responding variable," "explained variable," "outcome variable," "experimental variable," and/or "output variable."

Module 4: Regression bioinformatics.ca Simple Linear regression Characteristics:  Only two variables are of interest  One variable is a response and one a predictor  No adjustment is needed for confounding or other between-subject variation Assumptions  Linearity  σ2 is constant, independent of x   i are independent of each other  For proper statistical inference (CI, p-values),  i are normal distributed  No outliers  X measured without error

Module 4: Regression bioinformatics.ca A Simple Example Investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. A varied amount of fertilizer was randomly assigned to 11 plots of land and the yield measured at the end of the season. The amount of fertilizer applied to each plot was chosen Interest also lies in predicting the yield when 16 kg/ha are assigned. At the end of the experiment, the yields were measured and the following data were obtained.

Module 4: Regression bioinformatics.ca We are interested in fitting the line

Module 4: Regression bioinformatics.ca Linear regression analysis includes: A. Estimation of the parameters; B. Characterization of goodness of fit. Linear regression

Module 4: Regression bioinformatics.ca For a linear model, estimated parameters a, b Estimation: choose parameters a, b so that the SSE is as small as possible. We call these: least squares estimates. This method of least squares has an analytic solution for the linear case. Linear regression: estimation

Module 4: Regression bioinformatics.ca Linear regression: residuals

Module 4: Regression bioinformatics.ca The model we fit summary of the residuals Parameter Estimates Other Useful things

Module 4: Regression bioinformatics.ca The fitted line

Module 4: Regression bioinformatics.ca Interpretation of the R output The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit. The estimated intercept is the estimated yield when the amount of fertilizer is 0. The estimated standard error is an estimate of the standard deviation over all possible experiments. It can be used to construct an approximate confidence interval:

Module 4: Regression bioinformatics.ca Hypothesis testing in LM In linear regression problems, one hypothesis of interest is if the true slope is zero. Compute the test statistic This will be compared to a t-distribution with n-2 = 9 degrees of freedom. The p-value is found to be very small (less than ). We can conclude that there is strong evidence that the true slope is not zero.

Module 4: Regression bioinformatics.ca What about predictions? What would be the future yield when 16kg/ha of fertilizer are applied? Interpretation? The 95% confidence interval of the mean yield in tomatoes when the 16kg/ha of fertilizer is applied is between and litres.

Module 4: Regression bioinformatics.ca Prediction Interval for a single observation We can also compute prediction intervals for one single future observation Prediction intervals for one single observation are wider than confidence intervals for the mean.

Module 4: Regression bioinformatics.ca Two parts: Is the model adequate? Residuals Are the parameter estimates good? Prediction confidence limits Mean square error Cross Validation Linear regression: quality control

Module 4: Regression bioinformatics.ca Residual plots allow us to validate underlying assumptions: –Relationship between response and regressor should be linear (at least approximately). –Error term,  should have zero mean –Error term,  should have constant variance –Errors should be normally distributed (required for tests and intervals) Linear regression: quality control

Module 4: Regression bioinformatics.ca Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Check constant variance and linearity, and look for potential outliers. Linear regression: quality control

Module 4: Regression bioinformatics.ca

Module 4: Regression bioinformatics.ca Residuals vs. similarly distributed normal deviates check the normality assumption Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis AdequateInadequate Linear regression: Q-Q plot

Module 4: Regression bioinformatics.ca

Module 4: Regression bioinformatics.ca If the model is valid, i.e. nothing terrible in the residuals, we can use it to predict. But how good is the prediction? Linear regression: Evaluating accuracy

Module 4: Regression bioinformatics.ca Another Example Relationship between mercury in food and in the blood. Outliers?

Module 4: Regression bioinformatics.ca

Module 4: Regression bioinformatics.ca

Module 4: Regression bioinformatics.ca The New Fitted Line With Prediction Intervals #sort on X o=order(merc2[,1]) mercn=merc2[o,] #Compute prediction and confidence intervals pc=predict(Merc_fit,mercn,interval="confidence") pp=predict(Merc_fit,mercn,interval="prediction") plot (mercn, xlab="Mercury in Food", ylab="Mercury in Blood") matlines(mercn[,1],pc,ltv=c(1,2,2),col="black") matlines(mercn[,1],pp,ltv=c(1,3,3),col="red ")

Module 4: Regression bioinformatics.ca Multiple Linear Regression Similar to simple linear regression, but with multiple predictors. Not to be confused with multivariate regression which has multiple responses. Many of the concepts carry over directly from simple linear regression. The model becomes:

Module 4: Regression bioinformatics.ca Model Assumptions Marginal linearity. Random sampling. No outlier or influential points. Constant variance. Independence of observations. Normality of the errors. Predictors are measured without error.

Module 4: Regression bioinformatics.ca An Example: the Stackloss dataset The data sets stack.loss and stack.x contain information on ammonia loss in a manufacturing (oxidation of ammonia to nitric acid) plant measured on 21 consecutive days. The stack.x data set is a matrix with 21 rows and 3 columns representing three predictors: air flow ( Air.Flow ) to the plant, cooling water inlet temperature (C) ( Water.Temp ), and acid concentration ( Acid.Conc. ) as a percentage (coded by subtracting 50 and then multiplying by 10). The stack.loss data set is a vector of length 21 containing percent of ammonia lost x10 (the response variable).

Module 4: Regression bioinformatics.ca

Module 4: Regression bioinformatics.ca

Module 4: Regression bioinformatics.ca Would a transformation be appropriate?

Module 4: Regression bioinformatics.ca

Module 4: Regression bioinformatics.ca Careful with the interpretation!

Module 4: Regression bioinformatics.ca Model Selection Simple model (parsimonious). Only include variables that significantly improve model. One simple way to do it, fit a model with all of the variables, and ask if we can drop one. It lowers the risk of over-fitting. In our example we can compare a model that has all three predictors or one that has two (Acid Conc. omitted).

Module 4: Regression bioinformatics.ca Variable Selection: Procedure Model selection follows five general steps: 1. Specify the maximum model (i.e. the largest set of predictors). 2. Specify a criterion for selecting a model. 3. Specify a strategy for selecting variables. 4. Specify a mechanism for fitting the models – usually least squares. 5. Assess the goodness-of-fit of the the models and the predictions.

Module 4: Regression bioinformatics.ca Some Criteria that can be used R2: the proportion of total variation in the data that is explained by the predictors. Fp: hypothesis tests to see find the set of p variables that is not statistically different from the full model. MSEp: the set of p variables gives the the smallest estimated residual variance about the regression line. Cp an AIC/BIC: a combination of fit and penalty for number of predictors.

Module 4: Regression bioinformatics.ca Choosing which Subset to examine All possible subsets Forward addition Backwards elimination Stepwise selection

Module 4: Regression bioinformatics.ca Regression is a statistical technique for investigating and modeling the relationship between variables, which allows: Parameter Estimation Hypothesis testing Use the Model (Prediction) It's a powerful framework that can be readily generalized. You need to be familiar with your data, simulate it in various ways and check the model assumptions carefully! Regression: summary

Module 4: Regression bioinformatics.ca We are on a Coffee Break & Networking Session