Download presentation
Presentation is loading. Please wait.
Published byTrevor Carson Modified over 8 years ago
1
Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research
2
CONTENTS Correlation coefficients meaning values role significance Regression line of best fit prediction significance 2
3
INTRODUCTION Correlation the strength of the linear relationship between two variables Regression analysis determines the nature of the relationship Is there a relationship between the number of units of alcohol consumed and the likelihood of developing cirrhosis of the liver? 3
4
PEARSON’S COEFFICIENT OF CORRELATION (r) Measures the strength of the linear relationship between one dependent and one independent variable curvilinear relationships need other techniques Values lie between +1 and -1 perfect positive correlation r = +1 perfect negative correlation r = -1 no linear relationship r = 0 4
5
PEARSON’S COEFFICIENT OF CORRELATION 5 r = +1 r = -1 r = 0.6 r = 0
6
SCATTER PLOT 6 dependent variable make inferences about independent variable Calcium intake BMD
7
NON-NORMAL DATA 7
8
NORMALISED 8
9
SPSS OUTPUT: SCATTER PLOT 9
10
SPSS OUTPUT: CORRELATIONS 10
11
11 Interpreting correlation Large r does not necessarily imply: strong correlation r increases with sample size cause and effect strong correlation between the number of televisions sold and the number of cases of paranoid schizophrenia watching TV causes paranoid schizophrenia may be due to indirect relationship
12
12 Interpreting correlation Variation in dependent variable due to: relationship with independent variable: r 2 random factors: 1 - r 2 r 2 is the Coefficient of Determination or Variation explained e.g. r = 0.661 r 2 = = 0.44 less than half of the variation (44%) in the dependent variable due to independent variable
13
13
14
14 Agreement Correlation should never be used to determine the level of agreement between repeated measures: measuring devices users techniques It measures the degree of linear relationship You can have high correlation with poor agreement
15
15 Non-parametric correlation Make no assumptions Carried out on ranks Spearman’s easy to calculate Kendall’s has some advantages over distribution has better statistical properties easier to identify concordant / discordant pairs Usually both lead to same conclusions
16
16 Role of regression Shows how one variable changes with another By determining the line of best fit linear curvilinear
17
17 Line of best fit Simplest case linear Line of best fit between: dependent variable Y BMD independent variable X dietary intake of Calcium value of Y when X=0 Y = a + bX change in Y when X increases by 1
18
18 Role of regression Used to predict the value of the dependent variable when value of independent variable(s) known within the range of the known data extrapolation risky! relation between age and bone age Does not imply causality
19
SPSS OUTPUT: REGRESSION 19
20
20 Multiple regression More than one independent variable BMD dependent on: age gender calorific intake Use of bisphosphonates Exercise etc
21
21 Summary Correlation strength of linear relationship between two variables Pearson’s - parametric Spearman’s / Kendall’s non-parametric Interpret with care! Regression line of best fit prediction Multiple regression logistic
22
Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Statistics for Health Research
23
Objectives of session Recognise the need to check fit of the model Recognise the need to check fit of the model Carry out checks of assumptions in SPSS for simple linear regression Carry out checks of assumptions in SPSS for simple linear regression Understand predictive model Understand predictive model Understand residuals Understand residuals
24
How is the fitted line obtained? Use method of least squares (LS) Seek to minimise squared vertical differences between each point and fitted line Results in parameter estimates or regression coefficients of slope (b) and intercept (a) – y=a+bx
25
Consider Fitted line of y = a +bx Explanatory (x) Dependent (y) a
26
Consider the regression of age on minimum LDL cholesterol achieved Select Regression Select Regression Linear…. Linear…. Dependent (y) – Min LDL achieved Dependent (y) – Min LDL achieved Independent (x) - Age_Base Independent (x) - Age_Base
27
N.B. 0.008 may look very small but represents: The DECREASE in LDL achieved for each increase in one unit of age i.e. ONE year Output from SPSS linear regression Coefficients a ModelUnstandardized CoefficientsStandardized Coefficients BStd. ErrorBetatsig 1(Constant)2.024.10519.340.000 Age at baseline-.008.002-.121-4.546.000 a. Dependent Variable: Min LDL achieved
28
H 0 : slope b = 0 Test t = slope/se = -0.008/0.002 = 4.546 with p<0.001, so statistically significant Predicted LDL = 2.024 - 0.008xAge Output from SPSS linear regression Coefficients a ModelUnstandardized CoefficientsStandardized Coefficients BStd. ErrorBetatsig 1(Constant)2.024.10519.340.000 Age at baseline-.008.002-.121-4.546.000 a. Dependent Variable: Min LDL achieved
29
Predicted LDL achieved = 2.024 - 0.008xAge So for a man aged 65 the predicted LDL achieved = 2.024 – 0.008x 65 = 1.504 Prediction Equation from linear regression AgePredicted Min LDL 451.664 551.584 651.504 751.424
30
Assumptions of Regression 1. Relationship is linear 2. Outcome variable and hence residuals or error terms are approx. Normally distributed
31
Use Graphs and Scatterplot to obtain the Lowess line of fit
32
1.Create Scatterplot and then double-click to enter chart editor 2.Chose Icon ‘Add fit line at total’ 3.Then select type of fit such as Lowess
33
Linear assumption: Fitted lowess smoothed line Lowess smoothed line (red) gives a good eyeball examination of linear assumption (green)
34
Definition of a residual A residual is the difference between the predicted value (fitted line) and the actual value or unexplained variation r i = y i – E ( y i ) Or r i = y i – ( a + bx )
35
Residuals
36
To assess the residuals in SPSS linear regression, select plots….. Normalised or standardised predicted value of LDL Normalised residual Select histogram of residuals and normal probability plot
37
In SPSS linear regression, select Statistics….. Select confidence intervals for regression coefficients Model fit Select Durbin- Watson for serial correlation and identification of outliers
38
Output: Scatterplot of residuals vs. predicted Note 1)Mean of residuals = 0 2)Most of data lie within + or -3 SDs of mean
39
Assumptions of Regression 1. Relationship is linear 2. Outcome variable and hence residuals or error terms are approx. Normally distributed
40
Plot of residuals with normal curve super- imposed Output: Histogram of standardised residuals
41
Output: Cumulative probability plot Look for deviation from diagonal line to indicate non- normality
42
Output: Description of residuals Subjects with standardised residuals > 3 Descriptive statistics for residuals Worth investigation? Casewise Diagnostics(a) Case NumberStd. ResidualMin LDL Predicted Residual 1645.6605.58401.5181534.0658471 2094.3954.52601.3686853.1573148 2503.1433.78751.5293252.2581750 2683.0643.87301.6716642.2013357 2743.2274.09531.7771532.3180975 3624.0954.53501.5934602.9415398 5173.6364.32401.7117882.6122125 8493.9684.32901.4781132.8508873 10474.2074.43601.4136863.0223141 10753.8854.40401.6132192.7907805 11033.5193.99051.4625842.5279157 12293.0163.76601.5992542.1667456 12903.9754.23451.3791072.8553933 a. Dependent Variable: Min LDL achieved
43
R – correlation between min LDL achieved and Age at baseline, here 0.121 R 2 - % variation explained, here 1.5%, not particularly high Durbin-Watson test - serial correlation of residuals should be approximately 2 if no serial correlation Output: Model fit and serial correlation Model Summary ModelRR SquareAdjusted R SquareStd. Error of the Estimate Durbin-Watson 1.121 a.015.014.71840482.034 a. Predictors: (Constant), Age at baseline
44
Summary After fitting any regression model check assumptions - Functional form – linearity is default, often not best fit, consider quadratic… Functional form – linearity is default, often not best fit, consider quadratic… Check Residuals for approx. normality Check Residuals for approx. normality Check Residuals for outliers (> 3 SDs) Check Residuals for outliers (> 3 SDs) All accomplished within SPSS All accomplished within SPSS
45
Practical on Model Checking Read in ‘LDL Data.sav’ 1) Fit age squared term in min LDL model and check fit of model compared to linear fit (Hint: Use transform/compute to create age squared term and fit age and age 2 ) 2) Fit separate linear regressions with min Chol achieved with predictors of 1) baseline Chol 2) APOE_lin 3) adherence Check assumptions and interpret results
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.