Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Inference for Regression
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Linear regression models
EPI 809/Spring Probability Distribution of Random Error.
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
Chapter 12 Multiple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
The Simple Regression Model
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
SIMPLE LINEAR REGRESSION
Pengujian Parameter Koefisien Korelasi Pertemuan 04 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Chapter Topics Types of Regression Models
Introduction to Probability and Statistics Linear Regression and Correlation.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Introduction to Regression Analysis, Chapter 13,
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Linear Regression/Correlation
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Lecture 5 Correlation and Regression
Correlation & Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Chapter 14 Simple Regression
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
© 2003 Prentice-Hall, Inc.Chap 13-1 Basic Business Statistics (9 th Edition) Chapter 13 Simple Linear Regression.
Statistics for clinicians Biostatistics course by Kevin E. Kip, Ph.D., FAHA Professor and Executive Director, Research Center University of South Florida,
Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Lecture 9: ANOVA tables F-tests BMTRY 701 Biostatistical Methods II.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
1 11 Simple Linear Regression and Correlation 11-1 Empirical Models 11-2 Simple Linear Regression 11-3 Properties of the Least Squares Estimators 11-4.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Lecture 10: Correlation and Regression Model.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Chapter 8: Simple Linear Regression Yang Zhenlin.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Chapter 20 Linear and Multiple Regression
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
Statistics for Managers using Microsoft Excel 3rd Edition
Giới thiệu Phân tích hồi quy tuyến tính
Correlation and Simple Linear Regression
Prepared by Lee Revere and John Large
Chapter 12 Simple Linear Regression and Correlation
Correlation and Simple Linear Regression
Simple Linear Regression and Correlation
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney

Give a man three weapons – correlation, regression and a pen – and he will use all three (Anon, 1978)

An example IDAge Chol (mg/ml) Age and cholesterol levels in 18 individuals

Read data into R id <- seq(1:18) age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 43, 57, 33, 22, 63, 40, 48, 28, 49) chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0) plot(chol ~ age, pch=16)

Questions of interest Association between age and cholesterol levels Strength of association Prediction of cholesterol for a given age Correlation and Regression analysis

Variance and covariance: algebra Let x and y be two random variables from a sample of n obervations. Measure of variability of x and y: variance Measure of covariation between x and y ? Algebraically: var(x + y) = var(x) + var(y) var(x + y) = var(x) + var(y) + 2cov(x,y) Where:

Variance and covariance: geometry The independence or dependence between x and y can be represented geometrically: y x h h 2 = x 2 + y 2 x y h h 2 = x 2 + y 2 – 2xycos(H) H

Meaning of variance and covariance Variance is always positive If covariance = 0, x and y are independent. Covariance is sum of cross-products: can be positive or negative. Negative covariance = deviations in the two distributions in are opposite directions, e.g. genetic covariation. Positive covariance = deviations in the two distributions in are in the same direction. Covariance = a measure of strength of association.

Covariance and correlation Covariance is unit-depenent. Coefficient of correlation (r) between x and y is a standardized covariance. r is defined by:

Positive and negative correlation r = 0.9 r = -0.9

Test of hypothesis of correlation Hypothesis: H o : r = 0 versus H o : r not equal to 0. Standard error of r is: The t-statistic: This statistic has a t distribution with n – 2 degrees of freedom. Fisher’s z-transformation: Standard error of z: Then 95% CI of z can be constructed as:

An illustration of correlation analysis IDAge Cholesterol (x) (y; mg/100ml) Mean SD Cov(x, y) = t-statistic = 0.56 / 0.26 = 2.17 Critical t-value with 17 df and alpha = 5% is 2.11 Conclusion: There is a significant association between age and cholesterol.

Simple linear regression analysis Assessment: –Quantify the relationship between two variables Prediction –Make prediction and validate a test Control –Adjusting for confounding effect (in the case of multiple variables) Only two variables are of interest: one response variable and one predictor variable No adjustment is needed for confounding or covariate

Relationship between age and cholesterol

Linear regression: model Y : random variable representing a response X : random variable representing a predictor variable (predictor, risk factor) –Both Y and X can be a categorical variable (e.g., yes / no) or a continuous variable (e.g., age). –If Y is categorical, the model is a logistic regression model; if Y is continuous, a simple linear regression model. Model Y =  +  X +   : intercept  : slope / gradient  : random error (variation between subjects in y even if x is constant, e.g., variation in cholesterol for patients of the same age.)

Linear regression: assumptions The relationship is linear in terms of the parameter; X is measured without error; The values of Y are independently from each other (e.g., Y 1 is not correlated with Y 2 ) ; The random error term (  ) is normally distributed with mean 0 and constant variance.

Expected value and variance If the assumptions are tenable, then: The expected value of Y is: E(Y | x) =  +  x The variance of Y is: var(Y) = var(  ) =  2

Given two points A(x 1, y 1 ) and B(x 2, y 2 ) in a two-dimensional space, we can derive an equation connecting the points. A(x1,y1)A(x1,y1) B(x2,y2)B(x2,y2) Gradient: Equation: y = mx + a What happen if we have more than 2 points? a x y 0 dy dx Estimation of model parameters

Estimation of  and  For a series of pairs: (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), …, (x n, y n ) Let a and b be sample estimates for parameters  and , We have a sample equation: Y * = a + bx Aim: finding the values of a and b so that (Y – Y * ) is minimal. Let SSE = sum of (Y i – a – bx i ) 2. Values of a and b that minimise SSE are called least square estimates.

Criteria of estimation Chol Age yiyi The goal of least square estimator (LSE) is to find a and b such that the sum of d 2 is minimal.

Estimation of  and  After some calculus operations, the results can be shown to be: Where: When the regression assumptions are valid, the estimators of  and  have the following properties: –Unbiased –Uniformly minimal variance (eg efficient)

Goodness-of-fit Now, we have the equation Y = a + bX + e Question: how well the regression equation describe the actual data? Answer: coefficient of determination (R 2 ): the amount of variation in Y is explained by the variation in X.

Partitioning of variations: concept SST = sum of squared difference between y i and the mean of y. SSR = sum of squared difference between the predicted value of y and the mean of y. SSE = sum of squared difference between the observed and predicted value of y. SST = SSR + SSE The the coefficient of determination is: R 2 = SSR / SST

Partitioning of variations: geometry Chol (Y) Age (X) mean SSR SSE SST

Partitioning of variations: algebra Some statistics: Total variation: Attributed to the model: Residual sum of square: SST = SSR + SSE SSR = SST – SSE

Analysis of variance SS increases in proportion to sample size (n) Mean squares (MS): normalise for degrees of freedom (df) –MSR = SSR / p (where p = number of degrees of freedom) –MSE = SSE / (n – p – 1) –MST = SST / (n – 1) Analysis of variance (ANOVA) table: Sourced.f.Sum of squares (SS) Mean squares (MS) F-test Regression Residual Total p N–p –1 n – 1 SSR SSE SST MSR MSE MSR/MSE

Hypothesis tests in regression analysis Now, we have Sample data: Y = a + bX + e Population: Y =  +  X +  H o :  = 0. There is no linear association between the outcome and predictor variable. In layman language: “what is the chance, given the sample data that we observed, of observing a sample of data that is less consistent with the null hypothesis of no association?”

Inference about slope (parameter  ) Recall that  is assumed to be normally distributed with mean 0 and variance =  2. Estimate of  2 is MSE (or s 2 ) It can be shown that –The expected value of b is , i.e. E(b) =  –The standard error of b is: Then the test whether  = 0 is: t = b / SE(b) which follows a t-distribution with n-1 degrees of freedom.

Confidence interval around predicted valued Observed value is Y i. Predicted value is The standard error of the predicted value is: Interval estimation for Y i values

Checking assumptions Assumption of constant variance Assumption of normality Correctness of functional form Model stability All can be conducted with graphical analysis. The residuals from the model or a function of the residuals play an important role in all of the model diagnostic procedures.

Checking assumptions Assumption of constant variance –Plot the studentized residuals versus their predicted values. Examine whether the variability between residuals remains relatively constant across the range of fitted values. Assumption of normality –Plot the residuals versus their expected values under normality (Normal probability plot). If the residuals are normally distributed, it should fall along a 45 o line. Correct functional form? –Plot the residuals versus fitted values. Examine whether the residual plot for evidence of a non-linear trend in the value of the residual across the range of fitted values. Model stability –Check whether one or more observations are influential. Use Cook’s distance.

Checking assumptions (Cont) Cook’s distance (D) is a measure of the magnitude by which the fitted values of the regression model change if the ith observation is removed from the data set. Leverage is a measure of how extreme the value of x i is relative to the remaining value of x. The Studentized residual provides a measure of how extreme the value of y i is relative to the remaining value of y.

Remedial measures Non-constant variance –Transform the response variable (y) to a new scale (e.g. logarithm) is often helpful. –If no transformation can achieve the non-constant variance problem, use a more robust estimator such as iterative weighted least squares. Non-normality –Non-normality and non-constant variance go hand-in-hand. Outliers –Check for accuracy –Use robust estimator

Regression analysis using R id <- seq(1:18) age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 43, 57, 33, 22, 63, 40, 48, 28, 49) chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0) #Fit linear regression model reg <- lm(chol ~ age) summary(reg)

ANOVA result > anova(reg) Analysis of Variance Table Response: chol Df Sum Sq Mean Sq F value Pr(>F) age e-08 *** Residuals Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results of R analysis > summary(reg) Call: lm(formula = chol ~ age) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) *** age e-08 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 16 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 16 DF, p-value: 1.058e-08

Diagnostics: influential data par(mfrow=c(2,2)) plot(reg)

A non-linear illustration: BMI and sexual attractiveness –Study on 44 university students –Measure body mass index (BMI) –Sexual attractiveness (SA) score id <- seq(1:44) bmi <- c(11.00, 12.00, 12.50, 14.00, 14.00, 14.00, 14.00, 14.00, 14.00, 14.80, 15.00, 15.00, 15.50, 16.00, 16.50, 17.00, 17.00, 18.00, 18.00, 19.00, 19.00, 20.00, 20.00, 20.00, 20.50, 22.00, 23.00, 23.00, 24.00, 24.50, 25.00, 25.00, 26.00, 26.00, 26.50, 28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50, 36.00, 36.00) sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3, 6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7)

Linear regression analysis of BMI and SA reg <- lm (sa ~ bmi) summary(reg) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e-09 *** bmi * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 42 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: on 1 and 42 DF, p-value:

BMI and SA: analysis of residuals plot(reg)

BMI and SA: a simple plot par(mfrow=c(1,1)) reg <- lm(sa ~ bmi) plot(sa ~ bmi, pch=16) abline(reg)

# Fit 3 regression models linear <- lm(sa ~ bmi) quad <- lm(sa ~ poly(bmi, 2)) cubic <- lm(sa ~ poly(bmi, 3)) # Make new BMI axis bmi.new <- 10:40 # Get predicted values quad.pred <- predict(quad,data.frame(bmi=bmi.new)) cubic.pred <- predict(cubic,data.frame(bmi=bmi.new)) # Plot predicted values abline(reg) lines(bmi.new, quad.pred, col="blue",lwd=3) lines(bmi.new, cubic.pred, col="red",lwd=3) Re-analysis of sexual attractiveness data

Some comments: Interpretation of correlation Correlation lies between –1 and +1. A very small correlation does not mean that no linear association between the two variables. The relationship may be non-linear. For curlinearity, a rank correlation is better than the Pearson’s correlation. A small correlation (eg 0.1) may be statistically significant, but clinically unimportant. R 2 is another measure of strength of association. An r = 0.7 may sound impressive, but R 2 is 0.49! Correlation does not mean causation.

Some comments: Interpretation of correlation Be careful with multiple correlations. For p variables, there are p(p – 1)/2 possible pairs of correlation, and false positive is a problem. Correlation can not be inferred directly from association. –r(age, weight) = 0.05; r(weight, fat) = 0.03; it does not mean that r(age, fat) is near zero. –In fact, r(age, fat) = 0.79.

Some comments: Interpretation of regression The fitted line (regression) is only an estimated of the relation between these variables in the population. Uncertainty associated with estimated parameters. Regression line should not be used to make prediction of x values outside the range of values in the observed data. A statistical model is an approximation; the “true” relation may be nonlinear, but a linear is a reasonable approximation.

Some comments: Reporting results Results should be reported in sufficient details: nature of response variable, predictor variable; any transformation; checking assumptions, etc. Regression coefficients (a, b), their associated standard errors, and R 2 are useful summary.

Some final comments Equations are the cornerstone on which the edifice of science rests. Equations are like poems, or even an onion. So, be careful with your building of equations!