Presentation is loading. Please wait.

Presentation is loading. Please wait.

Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research.

Similar presentations


Presentation on theme: "Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research."— Presentation transcript:

1 Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

2 CONTENTS Correlation coefficients meaning values role significance Regression line of best fit prediction significance 2

3 INTRODUCTION Correlation the strength of the linear relationship between two variables Regression analysis determines the nature of the relationship Is there a relationship between the number of units of alcohol consumed and the likelihood of developing cirrhosis of the liver? 3

4 PEARSON’S COEFFICIENT OF CORRELATION (r) Measures the strength of the linear relationship between one dependent and one independent variable curvilinear relationships need other techniques Values lie between +1 and -1 perfect positive correlation r = +1 perfect negative correlation r = -1 no linear relationship r = 0 4

5 PEARSON’S COEFFICIENT OF CORRELATION 5 r = +1 r = -1 r = 0.6 r = 0

6 SCATTER PLOT 6 dependent variable make inferences about independent variable Calcium intake BMD

7 NON-NORMAL DATA 7

8 NORMALISED 8

9 SPSS OUTPUT: SCATTER PLOT 9

10 SPSS OUTPUT: CORRELATIONS 10

11 11 Interpreting correlation Large r does not necessarily imply: strong correlation r increases with sample size cause and effect strong correlation between the number of televisions sold and the number of cases of paranoid schizophrenia watching TV causes paranoid schizophrenia may be due to indirect relationship

12 12 Interpreting correlation Variation in dependent variable due to: relationship with independent variable: r 2 random factors: 1 - r 2 r 2 is the Coefficient of Determination or Variation explained e.g. r = 0.661 r 2 = = 0.44 less than half of the variation (44%) in the dependent variable due to independent variable

13 13

14 14 Agreement Correlation should never be used to determine the level of agreement between repeated measures: measuring devices users techniques It measures the degree of linear relationship You can have high correlation with poor agreement

15 15 Non-parametric correlation Make no assumptions Carried out on ranks Spearman’s  easy to calculate Kendall’s  has some advantages over  distribution has better statistical properties easier to identify concordant / discordant pairs Usually both lead to same conclusions

16 16 Role of regression Shows how one variable changes with another By determining the line of best fit linear curvilinear

17 17 Line of best fit Simplest case linear Line of best fit between: dependent variable Y BMD independent variable X dietary intake of Calcium value of Y when X=0 Y = a + bX change in Y when X increases by 1

18 18 Role of regression Used to predict the value of the dependent variable when value of independent variable(s) known within the range of the known data extrapolation risky! relation between age and bone age Does not imply causality

19 SPSS OUTPUT: REGRESSION 19

20 20 Multiple regression More than one independent variable BMD dependent on: age gender calorific intake Use of bisphosphonates Exercise etc

21 21 Summary Correlation strength of linear relationship between two variables Pearson’s - parametric Spearman’s / Kendall’s non-parametric Interpret with care! Regression line of best fit prediction Multiple regression logistic

22 Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research

23 Objectives of session Recognise the need to check fit of the model Recognise the need to check fit of the model Carry out checks of assumptions in SPSS for simple linear regression Carry out checks of assumptions in SPSS for simple linear regression Understand predictive model Understand predictive model Understand residuals Understand residuals

24 How is the fitted line obtained? Use method of least squares (LS) Seek to minimise squared vertical differences between each point and fitted line Results in parameter estimates or regression coefficients of slope (b) and intercept (a) – y=a+bx

25 Consider Fitted line of y = a +bx Explanatory (x) Dependent (y) a

26 Consider the regression of age on minimum LDL cholesterol achieved Select Regression Select Regression Linear…. Linear…. Dependent (y) – Min LDL achieved Dependent (y) – Min LDL achieved Independent (x) - Age_Base Independent (x) - Age_Base

27 N.B. 0.008 may look very small but represents: The DECREASE in LDL achieved for each increase in one unit of age i.e. ONE year Output from SPSS linear regression Coefficients a ModelUnstandardized CoefficientsStandardized Coefficients BStd. ErrorBetatsig 1(Constant)2.024.10519.340.000 Age at baseline-.008.002-.121-4.546.000 a. Dependent Variable: Min LDL achieved

28 H 0 : slope b = 0 Test t = slope/se = -0.008/0.002 = 4.546 with p<0.001, so statistically significant Predicted LDL = 2.024 - 0.008xAge Output from SPSS linear regression Coefficients a ModelUnstandardized CoefficientsStandardized Coefficients BStd. ErrorBetatsig 1(Constant)2.024.10519.340.000 Age at baseline-.008.002-.121-4.546.000 a. Dependent Variable: Min LDL achieved

29 Predicted LDL achieved = 2.024 - 0.008xAge So for a man aged 65 the predicted LDL achieved = 2.024 – 0.008x 65 = 1.504 Prediction Equation from linear regression AgePredicted Min LDL 451.664 551.584 651.504 751.424

30 Assumptions of Regression 1. Relationship is linear 2. Outcome variable and hence residuals or error terms are approx. Normally distributed

31 Use Graphs and Scatterplot to obtain the Lowess line of fit

32 1.Create Scatterplot and then double-click to enter chart editor 2.Chose Icon ‘Add fit line at total’ 3.Then select type of fit such as Lowess

33 Linear assumption: Fitted lowess smoothed line Lowess smoothed line (red) gives a good eyeball examination of linear assumption (green)

34 Definition of a residual A residual is the difference between the predicted value (fitted line) and the actual value or unexplained variation r i = y i – E ( y i ) Or r i = y i – ( a + bx )

35 Residuals

36 To assess the residuals in SPSS linear regression, select plots….. Normalised or standardised predicted value of LDL Normalised residual Select histogram of residuals and normal probability plot

37 In SPSS linear regression, select Statistics….. Select confidence intervals for regression coefficients Model fit Select Durbin- Watson for serial correlation and identification of outliers

38 Output: Scatterplot of residuals vs. predicted Note 1)Mean of residuals = 0 2)Most of data lie within + or -3 SDs of mean

39 Assumptions of Regression 1. Relationship is linear 2. Outcome variable and hence residuals or error terms are approx. Normally distributed

40 Plot of residuals with normal curve super- imposed Output: Histogram of standardised residuals

41 Output: Cumulative probability plot Look for deviation from diagonal line to indicate non- normality

42 Output: Description of residuals Subjects with standardised residuals > 3 Descriptive statistics for residuals Worth investigation? Casewise Diagnostics(a) Case NumberStd. ResidualMin LDL Predicted Residual 1645.6605.58401.5181534.0658471 2094.3954.52601.3686853.1573148 2503.1433.78751.5293252.2581750 2683.0643.87301.6716642.2013357 2743.2274.09531.7771532.3180975 3624.0954.53501.5934602.9415398 5173.6364.32401.7117882.6122125 8493.9684.32901.4781132.8508873 10474.2074.43601.4136863.0223141 10753.8854.40401.6132192.7907805 11033.5193.99051.4625842.5279157 12293.0163.76601.5992542.1667456 12903.9754.23451.3791072.8553933 a. Dependent Variable: Min LDL achieved

43 R – correlation between min LDL achieved and Age at baseline, here 0.121 R 2 - % variation explained, here 1.5%, not particularly high Durbin-Watson test - serial correlation of residuals should be approximately 2 if no serial correlation Output: Model fit and serial correlation Model Summary ModelRR SquareAdjusted R SquareStd. Error of the Estimate Durbin-Watson 1.121 a.015.014.71840482.034 a. Predictors: (Constant), Age at baseline

44 Summary After fitting any regression model check assumptions - Functional form – linearity is default, often not best fit, consider quadratic… Functional form – linearity is default, often not best fit, consider quadratic… Check Residuals for approx. normality Check Residuals for approx. normality Check Residuals for outliers (> 3 SDs) Check Residuals for outliers (> 3 SDs) All accomplished within SPSS All accomplished within SPSS

45 Practical on Model Checking Read in ‘LDL Data.sav’ 1) Fit age squared term in min LDL model and check fit of model compared to linear fit (Hint: Use transform/compute to create age squared term and fit age and age 2 ) 2) Fit separate linear regressions with min Chol achieved with predictors of 1) baseline Chol 2) APOE_lin 3) adherence Check assumptions and interpret results


Download ppt "Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research."

Similar presentations


Ads by Google