Stats Club Marnie Brennan Correlation Stats Club Marnie Brennan
Have you used correlation before?
References Petrie and Sabin - Medical Statistics at a Glance: Chapter 26 Good Petrie and Watson - Statistics for Veterinary and Animal Science: Chapter 10 & 12 (12.7) Good Kirkwood and Sterne – Essential Medical Statistics: Chapter 10 & 30 Thrusfield – Veterinary Epidemiology: Chapter 14 (pges 263-264)
Other reads Mathematics Learning Support Centre – Statistics: Correlation - Good http://www.lboro.ac.uk/media/wwwlboroacuk/content/mlsc/downloads/Correlation.pdf Bewick, V, Cheek, L and Ball, J (2003) Statistics review 7: Correlation and regression. Critical Care, Vol. 7, 451-459 Swinscow, TDV (1976) Statistics at square one: XVIII – Correlation. British Medical Journal, Vol. 2, 680-681. Swinscow, TDV (1976) Statistics at square one: XIX – Correlation (continued). British Medical Journal, Vol. 2, 747-748. Swinscow, TDV (1976) Statistics at square one: XIX – Correlation (concluded). British Medical Journal, Vol. 2, 802-803.
What is correlation? Use it to look at the relationship between two continuous (numerical) variables To see if you have a linear relationship between them If you interchanged the x and y axes, you would still have the same relationship To measure the degree of the relationship
How does it differ from other calculations? T-tests and ANOVAs These measure the differences between subsets within your data/between groups Regression (this will be covered in a later session) This describes the linear relationship between two variables Describes how one variable (independent) predicts the other (dependent) You cannot interchange the variables between the x and y axes
When would you use correlation? To see if there is a relationship between two variables, and how strong it is As a prequel to linear regression If variables are highly correlated (or collinear), this will effect how they interact in a linear regression calculation Therefore, you need to know whether variables are correlated or not before you do a linear regression Other reasons??
When wouldn’t you use correlation! To compare two different methods of measurement on the same thing (reliability) E.g. How many neutrophils are in a blood sample? Compare using an IDEXX machine versus counting on a blood smear To compare the same method of measurement but used multiple times (reproducibility) E.g. How many neutrophils are in a blood sample? Preparing two blood smears from the same sample, and comparing the results Kappa (and associated) analysis – covered in a previous session Categorical data
First step – descriptive stats Scatter plot What does the relationship look like?
Shape and what does it mean? Does it increase or decrease as the values get higher? Does the relationship look linear? How ‘steep’ is the slope of the shape? Are there any outliers? Taken from Petrie and Sabin
Scatter plot examples
Demonstration if required….
Demonstration if required…. GenStat Graphics, 2D Scatter Plot (Y variate, X variate) Run
Measurement for correlation Correlation coefficient Numerical representation of the degree of association between the variables Between -1 and +1 Positive – the relationship is increasing with increasing values Negative – the relationship is decreasing with decreasing values It is dimensionless (spooky….) No units of measurement
What are the assumptions that have to be satisfied? There is a linear relationship between the variables There are no outliers present There are no subgroups within the data that affect the relationship e.g. sex Multiple data from the same subjects Taken from Petrie and Sabin
Represented as r (rho) Calculation of correlation coefficient Pearson’s coefficient – r If our null hypothesis is that there is no relationship e.g. the correlation coefficient is zero At least one variable has to be normally distributed Taken from Petrie and Sabin
P-values and confidence intervals As standard with p-values (you can calculate confidence intervals – not normally included) You state your cut off e.g. if p<0.05, reject the null hypothesis and conclude that there is a relationship A significant result is not necessarily CAUSAL There is an ASSOCIATION
Minitab
Minitab – Pearsons
SPSS
SPSS - Pearsons
GenStat GenStat Stat, summary, correlations
What if our assumptions cannot be met? Use Spearman’s rank correlation coefficient (or Kendall’s coefficient) Rank your variables from lowest to highest, and then calculate the correlation coefficient Still between -1 and 1 Remove outliers Rule of thumb – if the outliers are outside +/-2 standard deviations away from the group mean, you can remove them Transform data into a linear relationship????
Minitab – Spearman’s
Minitab – Spearman’s
SPSS – Spearman’s
SPSS – Spearman’s
GenStat GenStat Stat, summary, correlations Rank??
How does this fit with what you do or have seen/experienced?
Summary Make sure you are using correlation in the correct way Eyeball first with a scatter plot – look for strength, slope, relationship, outliers If it doesn’t fit the assumptions, use a non-parametric equivalent e.g. Spearman’s
Next time… Linear regression……