Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Module 4 Regression

4 Module 4: Regression bioinformatics.ca Regression What is regression? One of the most widely used statistical methodology. Describes how one variable (or set of variables) depend on another variable (or set of variables) Example: – Weight vs height – Yield vs fertilizer

5 Module 4: Regression bioinformatics.ca Outline Introduction Simple Linear Regression Multiple Linear Regression For both cases we will discuss – Assumptions. – Fitting a model in R and interpreting output. – Model assessment. Some model selection procedures.

6 Module 4: Regression bioinformatics.ca Regression – Regression aims to predict a response that could take on continuous values. – Often characterized as a quantitative prediction rather than qualitative. – Simple linear regression is a part of a much more general methodology: Generalized Linear Models. – Very closely related to t-test and ANOVA.

7 Module 4: Regression bioinformatics.ca Simple Linear Regression Model Linear regression assumes a particular model:  i are "errors" - not in the sense of being "wrong", but in the sense of creating deviations from the idealized model. The  i are assumed to be independent and N(0,  2 ) (normally distributed), they can also be called residuals. This model has two parameters: the regression coefficient , and the intercept . x i is the independent variable. Depending on the context, also known as a "predictor variable," "regressor," "controlled variable," "manipulated variable," "explanatory variable," "exposure variable," and/or "input variable." y i is the dependent variable, also known as "response variable," "regressand," "measured variable," "observed variable," "responding variable," "explained variable," "outcome variable," "experimental variable," and/or "output variable."

8 Module 4: Regression bioinformatics.ca Simple Linear regression Characteristics:  Only two variables are of interest  One variable is a response and one a predictor  No adjustment is needed for confounding or other between-subject variation Assumptions  Linearity  σ2 is constant, independent of x   i are independent of each other  For proper statistical inference (CI, p-values),  i are normal distributed  No outliers  X measured without error

9 Module 4: Regression bioinformatics.ca A Simple Example Investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. A varied amount of fertilizer was randomly assigned to 11 plots of land and the yield measured at the end of the season. The amount of fertilizer applied to each plot was chosen Interest also lies in predicting the yield when 16 kg/ha are assigned. At the end of the experiment, the yields were measured and the following data were obtained.

10 Module 4: Regression bioinformatics.ca We are interested in fitting the line

11 Module 4: Regression bioinformatics.ca Linear regression analysis includes: A. Estimation of the parameters; B. Characterization of goodness of fit. Linear regression

12 Module 4: Regression bioinformatics.ca For a linear model, estimated parameters a, b Estimation: choose parameters a, b so that the SSE is as small as possible. We call these: least squares estimates. This method of least squares has an analytic solution for the linear case. Linear regression: estimation

13 Module 4: Regression bioinformatics.ca Linear regression: residuals

14 Module 4: Regression bioinformatics.ca The model we fit summary of the residuals Parameter Estimates Other Useful things

15 Module 4: Regression bioinformatics.ca The fitted line

16 Module 4: Regression bioinformatics.ca Interpretation of the R output The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1 unit. The estimated intercept is the estimated yield when the amount of fertilizer is 0. The estimated standard error is an estimate of the standard deviation over all possible experiments. It can be used to construct an approximate confidence interval:

17 Module 4: Regression bioinformatics.ca Hypothesis testing in LM In linear regression problems, one hypothesis of interest is if the true slope is zero. Compute the test statistic This will be compared to a t-distribution with n-2 = 9 degrees of freedom. The p-value is found to be very small (less than 0.0001). We can conclude that there is strong evidence that the true slope is not zero.

18 Module 4: Regression bioinformatics.ca What about predictions? What would be the future yield when 16kg/ha of fertilizer are applied? Interpretation? The 95% confidence interval of the mean yield in tomatoes when the 16kg/ha of fertilizer is applied is between 28.81 and 32.15 litres.

19 Module 4: Regression bioinformatics.ca Prediction Interval for a single observation We can also compute prediction intervals for one single future observation Prediction intervals for one single observation are wider than confidence intervals for the mean.

20 Module 4: Regression bioinformatics.ca Two parts: Is the model adequate? Residuals Are the parameter estimates good? Prediction confidence limits Mean square error Cross Validation Linear regression: quality control

21 Module 4: Regression bioinformatics.ca Residual plots allow us to validate underlying assumptions: –Relationship between response and regressor should be linear (at least approximately). –Error term,  should have zero mean –Error term,  should have constant variance –Errors should be normally distributed (required for tests and intervals) Linear regression: quality control

22 Module 4: Regression bioinformatics.ca Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Check constant variance and linearity, and look for potential outliers. Linear regression: quality control

23 Module 4: Regression bioinformatics.ca

24 Module 4: Regression bioinformatics.ca Residuals vs. similarly distributed normal deviates check the normality assumption Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis AdequateInadequate Linear regression: Q-Q plot

25 Module 4: Regression bioinformatics.ca

26 Module 4: Regression bioinformatics.ca If the model is valid, i.e. nothing terrible in the residuals, we can use it to predict. But how good is the prediction? Linear regression: Evaluating accuracy

27 Module 4: Regression bioinformatics.ca Another Example Relationship between mercury in food and in the blood. Outliers?

28 Module 4: Regression bioinformatics.ca

29 Module 4: Regression bioinformatics.ca

30 Module 4: Regression bioinformatics.ca The New Fitted Line With Prediction Intervals #sort on X o=order(merc2[,1]) mercn=merc2[o,] #Compute prediction and confidence intervals pc=predict(Merc_fit,mercn,interval="confidence") pp=predict(Merc_fit,mercn,interval="prediction") plot (mercn, xlab="Mercury in Food", ylab="Mercury in Blood") matlines(mercn[,1],pc,ltv=c(1,2,2),col="black") matlines(mercn[,1],pp,ltv=c(1,3,3),col="red ")

31 Module 4: Regression bioinformatics.ca Multiple Linear Regression Similar to simple linear regression, but with multiple predictors. Not to be confused with multivariate regression which has multiple responses. Many of the concepts carry over directly from simple linear regression. The model becomes:

32 Module 4: Regression bioinformatics.ca Model Assumptions Marginal linearity. Random sampling. No outlier or influential points. Constant variance. Independence of observations. Normality of the errors. Predictors are measured without error.

33 Module 4: Regression bioinformatics.ca An Example: the Stackloss dataset The data sets stack.loss and stack.x contain information on ammonia loss in a manufacturing (oxidation of ammonia to nitric acid) plant measured on 21 consecutive days. The stack.x data set is a matrix with 21 rows and 3 columns representing three predictors: air flow ( Air.Flow ) to the plant, cooling water inlet temperature (C) ( Water.Temp ), and acid concentration ( Acid.Conc. ) as a percentage (coded by subtracting 50 and then multiplying by 10). The stack.loss data set is a vector of length 21 containing percent of ammonia lost x10 (the response variable).

34 Module 4: Regression bioinformatics.ca

35 Module 4: Regression bioinformatics.ca

36 Module 4: Regression bioinformatics.ca Would a transformation be appropriate?

37 Module 4: Regression bioinformatics.ca

38 Module 4: Regression bioinformatics.ca Careful with the interpretation!

39 Module 4: Regression bioinformatics.ca Model Selection Simple model (parsimonious). Only include variables that significantly improve model. One simple way to do it, fit a model with all of the variables, and ask if we can drop one. It lowers the risk of over-fitting. In our example we can compare a model that has all three predictors or one that has two (Acid Conc. omitted).

40 Module 4: Regression bioinformatics.ca Variable Selection: Procedure Model selection follows five general steps: 1. Specify the maximum model (i.e. the largest set of predictors). 2. Specify a criterion for selecting a model. 3. Specify a strategy for selecting variables. 4. Specify a mechanism for fitting the models – usually least squares. 5. Assess the goodness-of-fit of the the models and the predictions.

41 Module 4: Regression bioinformatics.ca Some Criteria that can be used R2: the proportion of total variation in the data that is explained by the predictors. Fp: hypothesis tests to see find the set of p variables that is not statistically different from the full model. MSEp: the set of p variables gives the the smallest estimated residual variance about the regression line. Cp an AIC/BIC: a combination of fit and penalty for number of predictors.

42 Module 4: Regression bioinformatics.ca Choosing which Subset to examine All possible subsets Forward addition Backwards elimination Stepwise selection

43 Module 4: Regression bioinformatics.ca Regression is a statistical technique for investigating and modeling the relationship between variables, which allows: Parameter Estimation Hypothesis testing Use the Model (Prediction) It's a powerful framework that can be readily generalized. You need to be familiar with your data, simulate it in various ways and check the model assumptions carefully! Regression: summary

44 Module 4: Regression bioinformatics.ca We are on a Coffee Break & Networking Session


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google