Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis Measuring the strength of a correlation Assumptions Confidence intervals and hypothesis testing Comparing correlations Non-parametric correlations Power in correlation analysis The underlying principle of correlation analysis Measuring the strength of a correlation Assumptions Confidence intervals and hypothesis testing Comparing correlations Non-parametric correlations Power in correlation analysis
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.2 The underlying principle of correlation analysis Measures the extent to which two variables covary, in particular, the strength of the linear association between them. No implied causal relationship, therefore there is no distinction between dependent and independent variables. Measures the extent to which two variables covary, in particular, the strength of the linear association between them. No implied causal relationship, therefore there is no distinction between dependent and independent variables. X1X1 X2X2
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.3 When do we use correlation? Do use it to determine the strength of association between to variables. Do not use it if you want to predict the value of X given Y, or vice versa. Do use it to determine the strength of association between to variables. Do not use it if you want to predict the value of X given Y, or vice versa. X1X1 X2X2 Correlation X Y Regression
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.4 Simple linear correlation versus simple linear regression Calculations are the same. In correlation analysis, one must sample randomly both X and Y. Correlation deals with association (importance). Regression deals with prediction (intensity). Calculations are the same. In correlation analysis, one must sample randomly both X and Y. Correlation deals with association (importance). Regression deals with prediction (intensity).
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.5 Lab example: fork length and round weight of sturgeon Since the two variables are not causally related, use correlation to measure strength of association.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.6 Regression: fork length and age of sturgeon The two variables are causally related. The relationship between the two provides an estimate of growth rates…...and we can use the relationship to predict the size of sturgeon of a given age. The two variables are causally related. The relationship between the two provides an estimate of growth rates…...and we can use the relationship to predict the size of sturgeon of a given age.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.7 Measuring the strength of a correlation Test statistic is the product-moment correlation coefficient r. X1X1 X2X2
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.8 Measuring the strength of a correlation r always lies between -1 and 1. r 2 is the coefficient of determination, which measures the proportion of the variance in X 1 (or X 2 ) “explained” by variation in X 2 or X 1. r always lies between -1 and 1. r 2 is the coefficient of determination, which measures the proportion of the variance in X 1 (or X 2 ) “explained” by variation in X 2 or X 1. X1X1 X2X2 X2X2 X2X2 r = 0.9 r = 0.5 r = 0 r = -0.5 r = -0.9
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.9 Assumptions of correlation analysis I: Bivariate normality For each value of X 1, X 2 values are normally distributed, and vice versa. r = 0.8 r = 0
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.10 Assumptions of correlation analysis II: Homoscedasticity The variance of X 1, given X 2, is independent, and vice versa. But the variances of X 1 and X 2 need not be equal. The variance of X 1, given X 2, is independent, and vice versa. But the variances of X 1 and X 2 need not be equal. X2X2 X1X1 X2X2 Homoscedastic Heteroscedastic
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.11 Assumptions of correlation analysis III: Linearity The relationship between X 1 and X 2 is linear. X2X2 Linear X1X1 X2X2 Nonlinear
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.12 Violation of assumptions: fork length and age of sturgeon Relationship between fork length and age appears non-linear. Variance in fork length appears to increase with age. Relationship between fork length and age appears non-linear. Variance in fork length appears to increase with age.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.13 If parametric correlation assumptions aren’t met... Try transforming the data (e.g. log transform). Try a non-parametric correlation analysis. Try transforming the data (e.g. log transform). Try a non-parametric correlation analysis.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.14 Confidence intervals for correlation coefficients confidence limit for Z- transformed correlation given by: Convert back to untransformed CI by: confidence limit for Z- transformed correlation given by: Convert back to untransformed CI by: X2X2 Smaller CI X2X2 X1X1 X2X2 Larger CI
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.15 Hypothesis testing I H 0 : = 0 Standard error of correlation coefficient given by: Calculate … and compare to t- distribution with N - 2 df. H 0 : = 0 Standard error of correlation coefficient given by: Calculate … and compare to t- distribution with N - 2 df. X2X2 Reject H 0 X2X2 Accept H 0 X1X1 X2X2 Observed Expected
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.16 Hypothesis testing II H 0 : r = Transform r and to Calculate … and compare Z distribution with N - 3 df. H 0 : r = Transform r and to Calculate … and compare Z distribution with N - 3 df. X2X2 Reject H 0 X2X2 X1X1 X2X2 Accept H 0 Observed Expected
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.17 Comparing 2 correlations H 0 : r 1 = r Transform r 1 and r to: Calculate … and compare to Z distribution. H 0 : r 1 = r Transform r 1 and r to: Calculate … and compare to Z distribution. X2X2 Reject H 0 X2X2 X1X1 X2X2 Accept H 0 r1r1 r2r2
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.18 Comparing multiple correlations H 0 : r i = r j = r k = … based on n i, n j, n k …observations Z transform all r i s to z i s and calculate … and compare to 2 distribution with df = k -1. H 0 : r i = r j = r k = … based on n i, n j, n k …observations Z transform all r i s to z i s and calculate … and compare to 2 distribution with df = k -1.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.19 Computing common correlations If H 0 : r i = r j = r k = … is accepted, then each r i estimates the same (population) correlation . To calculate , first calculate weighted Z-score z w : If H 0 : r i = r j = r k = … is accepted, then each r i estimates the same (population) correlation . To calculate , first calculate weighted Z-score z w : Then back-transform to get X2X2 X1X1 X2X2 Accept H 0 r1r1 r2r2 r3r3
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.20 Non-parametric correlations Use when one or more assumptions are not met. Essentially a parametric correlation of the ranks. Most common statistic is Spearman rank correlation. Use when one or more assumptions are not met. Essentially a parametric correlation of the ranks. Most common statistic is Spearman rank correlation. X2X2 X1X1 Rank X 1 Rank X 2
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.21 Power and sample size in correlation If we test H 0 : = 0 with sample size n, we can determine 1 - by using the Z-transformation for critical values (for given ) of the true correlation (z ) and sample correlation r (z r ). X1X1 X2X2 Z Probability
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.22 Power and sample size in correlation Once Z (1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. . Power is then 1- . Once Z (1) is determined, we can calculate the probability of obtaining a Z-value of this size or greater, i.e. . Power is then 1- . X1X1 X2X2 Z Probability
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.23 Power and sample size in correlation: an example Correlation of wing length and tail length of a sample of 12 birds so 1 - = 0.98 Correlation of wing length and tail length of a sample of 12 birds so 1 - = 0.98
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.24 Minimal sample size Given desired power 1 - , how large a sample is required to reject H 0 : = 0 if it is false with a specified Calculate: Given desired power 1 - , how large a sample is required to reject H 0 : = 0 if it is false with a specified Calculate: X2X2 Reject H 0 ? X2X2 X1X1 X2X2 Observed Expected
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.25 Minimal sample size: an example We want to reject H 0 : = 0 99% of the time when | > 0.5 and (2) =.05 So (1) =.01 and for r =.50, we have... We want to reject H 0 : = 0 99% of the time when | > 0.5 and (2) =.05 So (1) =.01 and for r =.50, we have... Hence So, a sample size of at least 64 should be used.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.26 Power and sample size in comparing 2 correlations Power of a test for difference between two correlation coefficients is 1- , where is one-tailed probability of: X2X2 Reject H 0 X2X2 X1X1 X2X2 Accept H 0 r1r1 r2r2
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.27 An example What is power to detect a difference? From table of normal deviates, What is power to detect a difference? From table of normal deviates, So, power = 0.22