Download presentation
Presentation is loading. Please wait.
Published byHarriet Ford Modified over 9 years ago
Simple Linear Regression: An Introduction Dr. Tuan V. Nguyen Garvan Institute of Medical Research Sydney
Give a man three weapons – correlation, regression and a pen – and he will use all three (Anon, 1978)
An example IDAge Chol (mg/ml) 1463.5 2201.9 3524.0 4302.6 5574.5 6253.0 7282.9 8363.8 9222.1 10433.8 11574.1 12333.0 13222.5 14634.6 15403.2 16484.2 17282.3 18494.0 Age and cholesterol levels in 18 individuals
Read data into R id <- seq(1:18) age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 43, 57, 33, 22, 63, 40, 48, 28, 49) chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0) plot(chol ~ age, pch=16)
Questions of interest Association between age and cholesterol levels Strength of association Prediction of cholesterol for a given age Correlation and Regression analysis
Variance and covariance: algebra Let x and y be two random variables from a sample of n obervations. Measure of variability of x and y: variance Measure of covariation between x and y ? Algebraically: var(x + y) = var(x) + var(y) var(x + y) = var(x) + var(y) + 2cov(x,y) Where:
Variance and covariance: geometry The independence or dependence between x and y can be represented geometrically: y x h h 2 = x 2 + y 2 x y h h 2 = x 2 + y 2 – 2xycos(H) H
Meaning of variance and covariance Variance is always positive If covariance = 0, x and y are independent. Covariance is sum of cross-products: can be positive or negative. Negative covariance = deviations in the two distributions in are opposite directions, e.g. genetic covariation. Positive covariance = deviations in the two distributions in are in the same direction. Covariance = a measure of strength of association.
Covariance and correlation Covariance is unit-depenent. Coefficient of correlation (r) between x and y is a standardized covariance. r is defined by:
Positive and negative correlation r = 0.9 r = -0.9
Test of hypothesis of correlation Hypothesis: H o : r = 0 versus H o : r not equal to 0. Standard error of r is: The t-statistic: This statistic has a t distribution with n – 2 degrees of freedom. Fisher’s z-transformation: Standard error of z: Then 95% CI of z can be constructed as:
An illustration of correlation analysis IDAge Cholesterol (x) (y; mg/100ml) 1463.5 2201.9 3524.0 4302.6 5574.5 6253.0 7282.9 8363.8 9222.1 10433.8 11574.1 12333.0 13222.5 14634.6 15403.2 16484.2 17282.3 18494.0 Mean38.833.33 SD13.600.84 Cov(x, y) = 10.68 t-statistic = 0.56 / 0.26 = 2.17 Critical t-value with 17 df and alpha = 5% is 2.11 Conclusion: There is a significant association between age and cholesterol.
Simple linear regression analysis Assessment: –Quantify the relationship between two variables Prediction –Make prediction and validate a test Control –Adjusting for confounding effect (in the case of multiple variables) Only two variables are of interest: one response variable and one predictor variable No adjustment is needed for confounding or covariate
Relationship between age and cholesterol
Linear regression: model Y : random variable representing a response X : random variable representing a predictor variable (predictor, risk factor) –Both Y and X can be a categorical variable (e.g., yes / no) or a continuous variable (e.g., age). –If Y is categorical, the model is a logistic regression model; if Y is continuous, a simple linear regression model. Model Y = + X + : intercept : slope / gradient : random error (variation between subjects in y even if x is constant, e.g., variation in cholesterol for patients of the same age.)
Linear regression: assumptions The relationship is linear in terms of the parameter; X is measured without error; The values of Y are independently from each other (e.g., Y 1 is not correlated with Y 2 ) ; The random error term ( ) is normally distributed with mean 0 and constant variance.
Expected value and variance If the assumptions are tenable, then: The expected value of Y is: E(Y | x) = + x The variance of Y is: var(Y) = var( ) = 2
Given two points A(x 1, y 1 ) and B(x 2, y 2 ) in a two-dimensional space, we can derive an equation connecting the points. A(x1,y1)A(x1,y1) B(x2,y2)B(x2,y2) Gradient: Equation: y = mx + a What happen if we have more than 2 points? a x y 0 dy dx Estimation of model parameters
Estimation of and For a series of pairs: (x 1, y 1 ), (x 2, y 2 ), (x 3, y 3 ), …, (x n, y n ) Let a and b be sample estimates for parameters and , We have a sample equation: Y * = a + bx Aim: finding the values of a and b so that (Y – Y * ) is minimal. Let SSE = sum of (Y i – a – bx i ) 2. Values of a and b that minimise SSE are called least square estimates.
Criteria of estimation Chol Age yiyi The goal of least square estimator (LSE) is to find a and b such that the sum of d 2 is minimal.
Estimation of and After some calculus operations, the results can be shown to be: Where: When the regression assumptions are valid, the estimators of and have the following properties: –Unbiased –Uniformly minimal variance (eg efficient)
Goodness-of-fit Now, we have the equation Y = a + bX + e Question: how well the regression equation describe the actual data? Answer: coefficient of determination (R 2 ): the amount of variation in Y is explained by the variation in X.
Partitioning of variations: concept SST = sum of squared difference between y i and the mean of y. SSR = sum of squared difference between the predicted value of y and the mean of y. SSE = sum of squared difference between the observed and predicted value of y. SST = SSR + SSE The the coefficient of determination is: R 2 = SSR / SST
Partitioning of variations: geometry Chol (Y) Age (X) mean SSR SSE SST
Partitioning of variations: algebra Some statistics: Total variation: Attributed to the model: Residual sum of square: SST = SSR + SSE SSR = SST – SSE
Analysis of variance SS increases in proportion to sample size (n) Mean squares (MS): normalise for degrees of freedom (df) –MSR = SSR / p (where p = number of degrees of freedom) –MSE = SSE / (n – p – 1) –MST = SST / (n – 1) Analysis of variance (ANOVA) table: Sourced.f.Sum of squares (SS) Mean squares (MS) F-test Regression Residual Total p N–p –1 n – 1 SSR SSE SST MSR MSE MSR/MSE
Hypothesis tests in regression analysis Now, we have Sample data: Y = a + bX + e Population: Y = + X + H o : = 0. There is no linear association between the outcome and predictor variable. In layman language: “what is the chance, given the sample data that we observed, of observing a sample of data that is less consistent with the null hypothesis of no association?”
Inference about slope (parameter ) Recall that is assumed to be normally distributed with mean 0 and variance = 2. Estimate of 2 is MSE (or s 2 ) It can be shown that –The expected value of b is , i.e. E(b) = –The standard error of b is: Then the test whether = 0 is: t = b / SE(b) which follows a t-distribution with n-1 degrees of freedom.
Confidence interval around predicted valued Observed value is Y i. Predicted value is The standard error of the predicted value is: Interval estimation for Y i values
Checking assumptions Assumption of constant variance Assumption of normality Correctness of functional form Model stability All can be conducted with graphical analysis. The residuals from the model or a function of the residuals play an important role in all of the model diagnostic procedures.
Checking assumptions Assumption of constant variance –Plot the studentized residuals versus their predicted values. Examine whether the variability between residuals remains relatively constant across the range of fitted values. Assumption of normality –Plot the residuals versus their expected values under normality (Normal probability plot). If the residuals are normally distributed, it should fall along a 45 o line. Correct functional form? –Plot the residuals versus fitted values. Examine whether the residual plot for evidence of a non-linear trend in the value of the residual across the range of fitted values. Model stability –Check whether one or more observations are influential. Use Cook’s distance.
Checking assumptions (Cont) Cook’s distance (D) is a measure of the magnitude by which the fitted values of the regression model change if the ith observation is removed from the data set. Leverage is a measure of how extreme the value of x i is relative to the remaining value of x. The Studentized residual provides a measure of how extreme the value of y i is relative to the remaining value of y.
Remedial measures Non-constant variance –Transform the response variable (y) to a new scale (e.g. logarithm) is often helpful. –If no transformation can achieve the non-constant variance problem, use a more robust estimator such as iterative weighted least squares. Non-normality –Non-normality and non-constant variance go hand-in-hand. Outliers –Check for accuracy –Use robust estimator
Regression analysis using R id <- seq(1:18) age <- c(46, 20, 52, 30, 57, 25, 28, 36, 22, 43, 57, 33, 22, 63, 40, 48, 28, 49) chol <- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8, 2.1, 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3, 4.0) #Fit linear regression model reg <- lm(chol ~ age) summary(reg)
ANOVA result > anova(reg) Analysis of Variance Table Response: chol Df Sum Sq Mean Sq F value Pr(>F) age 1 10.4944 10.4944 114.57 1.058e-08 *** Residuals 16 1.4656 0.0916 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Results of R analysis > summary(reg) Call: lm(formula = chol ~ age) Residuals: Min 1Q Median 3Q Max -0.40729 -0.24133 -0.04522 0.17939 0.63040 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.089218 0.221466 4.918 0.000154 *** age 0.057788 0.005399 10.704 1.06e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3027 on 16 degrees of freedom Multiple R-Squared: 0.8775, Adjusted R-squared: 0.8698 F-statistic: 114.6 on 1 and 16 DF, p-value: 1.058e-08
Diagnostics: influential data par(mfrow=c(2,2)) plot(reg)
A non-linear illustration: BMI and sexual attractiveness –Study on 44 university students –Measure body mass index (BMI) –Sexual attractiveness (SA) score id <- seq(1:44) bmi <- c(11.00, 12.00, 12.50, 14.00, 14.00, 14.00, 14.00, 14.00, 14.00, 14.80, 15.00, 15.00, 15.50, 16.00, 16.50, 17.00, 17.00, 18.00, 18.00, 19.00, 19.00, 20.00, 20.00, 20.00, 20.50, 22.00, 23.00, 23.00, 24.00, 24.50, 25.00, 25.00, 26.00, 26.00, 26.50, 28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50, 36.00, 36.00) sa <- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2, 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3, 6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7)
Linear regression analysis of BMI and SA reg <- lm (sa ~ bmi) summary(reg) Residuals: Min 1Q Median 3Q Max -2.54204 -0.97584 0.05082 1.16160 2.70856 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.92512 0.64489 7.637 1.81e-09 *** bmi -0.05967 0.02862 -2.084 0.0432 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.354 on 42 degrees of freedom Multiple R-Squared: 0.09376, Adjusted R-squared: 0.07218 F-statistic: 4.345 on 1 and 42 DF, p-value: 0.04323
BMI and SA: analysis of residuals plot(reg)
BMI and SA: a simple plot par(mfrow=c(1,1)) reg <- lm(sa ~ bmi) plot(sa ~ bmi, pch=16) abline(reg)
# Fit 3 regression models linear <- lm(sa ~ bmi) quad <- lm(sa ~ poly(bmi, 2)) cubic <- lm(sa ~ poly(bmi, 3)) # Make new BMI axis <- 10:40 # Get predicted values quad.pred <- predict(quad,data.frame( cubic.pred <- predict(cubic,data.frame( # Plot predicted values abline(reg) lines(, quad.pred, col="blue",lwd=3) lines(, cubic.pred, col="red",lwd=3) Re-analysis of sexual attractiveness data
Some comments: Interpretation of correlation Correlation lies between –1 and +1. A very small correlation does not mean that no linear association between the two variables. The relationship may be non-linear. For curlinearity, a rank correlation is better than the Pearson’s correlation. A small correlation (eg 0.1) may be statistically significant, but clinically unimportant. R 2 is another measure of strength of association. An r = 0.7 may sound impressive, but R 2 is 0.49! Correlation does not mean causation.
Some comments: Interpretation of correlation Be careful with multiple correlations. For p variables, there are p(p – 1)/2 possible pairs of correlation, and false positive is a problem. Correlation can not be inferred directly from association. –r(age, weight) = 0.05; r(weight, fat) = 0.03; it does not mean that r(age, fat) is near zero. –In fact, r(age, fat) = 0.79.
Some comments: Interpretation of regression The fitted line (regression) is only an estimated of the relation between these variables in the population. Uncertainty associated with estimated parameters. Regression line should not be used to make prediction of x values outside the range of values in the observed data. A statistical model is an approximation; the “true” relation may be nonlinear, but a linear is a reasonable approximation.
Some comments: Reporting results Results should be reported in sufficient details: nature of response variable, predictor variable; any transformation; checking assumptions, etc. Regression coefficients (a, b), their associated standard errors, and R 2 are useful summary.
Some final comments Equations are the cornerstone on which the edifice of science rests. Equations are like poems, or even an onion. So, be careful with your building of equations!
Similar presentations
© 2025 Inc.
All rights reserved.