Chapter 11 Association between two variables 第十一章 : 两变量关联性分析
§ 1 Simple Linear Correlation: r §2 Rank correlation: r s §3 Association between two categorical variables (r,τ) §4 Case discussion CONTENTS
Terminology Association 关联性 correlation 相 关 Intensity 密切程度 ( 强度 ) correlation coefficient 相关系数 Pearson correlation coefficient Pearson 相关系数 product moment correlation coefficient 积矩相关系数 Rank coefficient 秩相关系数 Spearman correlation coefficient Spearman 相关系数
§1 Simple Linear Correlation 1. Concept and descriptive statistic of relationship between two continuous variables 2. Statistical inference about correlation coefficient 3. Notes in application
1.Concept and descriptive statistic of relationship between two continuous variables The linear relationship between X and Y X,Y: random variables following normal distribution. both X and Y are measured from the same subject.
Cases on Studying Relationships Relationship between salt intake (X) and blood pressure (Y); Relationship between blood pressure (X) and body mass index (BMI) (Y); Relationship between height (X) and weight (Y); Relationship between blood pressure (X) and age (Y); ···
Correlation: to determine strength and direction of relationship between different variables Regression: to make predictions from one variable to the other based on the functional relationship between the variables Correlational techniques
Table 11-1 Concentration of thrombin ( /ml) and blood clotting time (second) from 15 healthy adults (Example 11-1) Subject concentration of thrombin Clotting time(S)
Figure 11-1 Scatter plot of relationship between thrombin concentration and clotting time
a: Positive correlation f: Non-linear correlation d: Perfect negative correlation b: perfect positive correlation e: Zero correlationc: Negative correlation Figure 11-2 Common types of relationship between two variables
Simple linear correlation coefficient ( r) Synonyms: Pearson correlation coefficient product moment correlation coefficient Definition: A statistical index to describe the intensity and the direction of association between two variables.
Symbol r: sample statistic; : population parameter Equation for calculating correlation coefficient
0<r<1: positive correlation -1<r<0: negative correlation r=1: complete positive correlation r=-1: complete negative correlation r=0: zero correlation ( no correlation) r=0: zero correlation Intensity ----absolute value of rDirection----sign of r
Intensity ----absolute value of r r1 stronger linear association r 0 weaker linear association Direction----sign of r +: positive correlation -: negative correlation
Procedure of calculating correlation coefficient 1) Graphing “scatter plot”: linear trend 2) Calculation of r
Subject i Concentration of thrombin x (u/ml) Clotting time y (second) x2x2 y2y2 x×y x×y sum
l XX =0.404 , l YY = , l XY = ) Calculation of r X,Y : stronger negative relationship
2. Statistical inference about correlation coefficient--- hypothesis test 1)Establish testing hypothesis, determining significant level α H 0 : =0 no linear association between X and Y H 1 : ≠0 linear association between X and Y exists =0.05 two-sided probability of type I error
2) Calculating statistic Method 1: t-test =n-2 For example 11-1, =15-2=13 From t distribution table, the critical value is t 0.05/2(13) =2.160 < |t|=8.874, P<0.05, correlation coefficient is statistically significant at α=0.05. thrombin concentration and clotting time are negatively related.
Method 2: Consulting Appendix Table 13 in page 486, The r-critical value: r 0.05(13) =0.514 < |r|=0.926, P<0.05 The conclusion is that there is linear association between the clotting time and thrombin concentration.
R.A. Fisher’s Z transformation
Fisher’s transformation normal distribution CI for Z CI for Calculation of CI for r z
Example 3(cont. of example 2) The researcher got a sample correlation coefficient r=0.82 (P<0.01), he wants to estimate the strength of correlation further.
Conclusion: the 99% CI of correlation coefficient between forearm length and height is (0.184, 0.972). 99% CI for Z: (0.186,2.134);
§ 2. Spearman Rank Correlation ( Spearman 秩相关) 1.Calculation of Spearman correlation coefficient 2. Statistical inference about Spearman correlation coefficient
Terminology: rank 秩,等级 rank correlation coefficient 秩相关系数 Spearman rank correlation is applied if two variables are distributed far from normal. Nonparametric method
1. Spearman Rank Correlation Coefficient (r s ) Rank ordering according to its magnitude of values for each of the two variables based on the ranks Calculating the Spearman rank correlation coefficient based on the ranks
Table 11-2 hemorrhage degrees and thrombocyte counts (109/L) from 12 children of acute leukemia Patientplatelet Rank: p x (px)2(px)2 bleeding Rank: p y (py)2(py)2 p x *p y (1)(2) (3)(4)(5)(6)(7)(8) – – – – – – total For equal ranks, mean rank is used instead. Six ‘–’s, mean=( )/6=3.5
Calculation of r s (numerical values are from Table 11-2) PatientplateletRank: p x (p x ) 2 bleedingRank: q y (q y ) 2 p x *q y (1)(2) (3)(4)(5)(6)(7)(8) total (Page 212)
(1) - 1≤r s ≤1 and similar meaning as r does (2) Difference between r s and r. r s ≠ r Calculated by ranks Calculated by original values of data Explanation of Spearman rank correlation coefficient: r s
2. Statistical inference about Spearman rank correlation coefficient: r s 1) Setting up hypothesis, determining significant level H 0 : s =0 H 1 : s 0 =0.05/2 2) Calculating test statistic and obtain critical value: a) Consulting Appendix Table 14:Critical value of r s if n≤50 b) Calculating t by equations 11-5 and 11-6 if n>50 a) r s =-0.422, n=12, from appendix Table 14 in page 487, the critical value: r s,0.05(12) =0.587> |r s |=0.422, P>0.05, failed to reject H 0
b)Calculating t by equations 11-5 and 11-6 : if n>50 For illustration, r s = is used to calculate t value as follows: 3) Conclusion: No association between hemorrhage degrees and thrombocyte counts. (The same conclusion has been obtained.)
§ 3. Association between two categorical variables (r, τ) 1.Association measures for 2 ×2 Table of cross classification data. 2. Association measures for 2 ×2 Table with pair-designed data 3. Association measures for R×C Table of cross classification data
1. Association measures for 2 ×2 Table of cross classification data. Table 11-3 diarrhea and feeding patterns of infants feeding pattern diarrhea Total YesNo Artificial feeding Breast feeding Total473582
feeding pattern diarrhea Total YesNo Artificial feeding Breast feeding Total Table 11-4 Data layout for 2 by 2 cross classification (Actual: A ij,Probability: ij, i,j=1,2) Variable X (i th row) Variable Y (j th column) Total Y1Y1 Y2Y2 X1X1 A 11 ( 11 )A 12 ( 12 )n1. (1.)n1. (1.) X2X2 A 21 ( 21 )A 12 ( 12 )n 2. ( 2.) Total n. 1 ( . 1 )n. 2 ( . 2 )n ( =1.0) Under H 0, i.≈n i./n, . j ≈n. j /n. Observed number in cell (i,j): A ij, i,j=1,2. Under independence, the joint probability of a particular combination of results by the multiplication rule is: ij = i.× . j (11-7) or Expected number: T ij = n i.×n. j /n (11-8)
1)Hypothesis test: Null hypothesis: H 0 : independence between the two variables H 1 : Association between the two variables feeding pattern diarrhea Total YesNo Artificial feeding30(22.93)10(17.07)40 Breast feeding17(24.07)25(17.93)42 Total ) Test statistic: 3) Pearson’s contingency coefficient: There is week association.
2. Association measures for 2 ×2 Table of pair-designed data Table 11-5 Results of bacillus diphtheriae in two Culture mediums (From example 11-7) Culture medium A Culture medium B Total Total243256
1)Hypothesis test: Null hypothesis: H 0 : independence between the two mediums H 1 : association between the two mediums 2) Test statistic: 3) Pearson’s contingency coefficient: There is week association. Culture medium A Culture medium BTotal Total243256
3. Association measures for R×C Table of cross classified data Table 11-6 cross classification by type of thyroid enlargement and ancestral residence ancestral residence types of thyroid enlargement Total widespreadnodularmixed A B C Total Question: Does type of thyroid enlargement associate with ancestral residence?
1)Establish testing hypothesis: Null hypothesis: H 0 : independence between the two variables H 1 : association between the two variables 2) Test statistic: formula (7-10) 3) Contingency coefficient: by using formula (11-9) 4)Conclusion: Type of thyroid enlargement associates with ancestral residence.
Notes in application 1. r=0 does not mean zero correlation (might be non- linear correlation) 2. It is not suitable to make correlation analysis when levels of either variable X or Y are artificially selected. 3. Outliers can influence correlation coefficient heavily. 4. Correlation cause-effect association, Correlation intrinsic association. 5. The difference between statistical significance and intensity of correlation: Statistical significance of correlation coefficient --- the probability of r from the population =0 is small. Intensity of correlation ----the absolute value of r
(a) Zero Correlation changed to Strong Correlation Degree of correlation is influenced by the extreme value (outlier).
(b) Strong Correlation changed to Zero Correlation Note: Scatter diagram can help you find the outliers.
A survey on relationship between student ’ s height and family income in a primary school after poolingin each stratum
You may miss another type of relationship. No linear correlational relationship (P> ) does not mean zero correlation.
The SAS-CORR Procedure PROC CORR DATA=SAS-data-set; VAR variable1 variable2; RUN;
1. Simple linear correlation coefficient: r 2. Spearman rank correlation coefficient: r s 3. Association between two categorical variables (r orτ) SUMMARY
Assignments 1. Patterns of relationship between two continuous variables. 2. Properties of simple linear correlation coefficient r. 3. How many kinds of correlation coefficients are there? What type of variables is required for each of these correlation coefficients. 4. 1, 5, 6. (pp )
Reading materials (1) 《卫生统计学》(主编:方积乾) 第十一章 ( 仇小强 王彤编写) (pp ) (2) 《 Biostatistical Analysis 》 Chapter 19: 19.1, 19.2, 19.9, (pp ):
50 Thanks for attention !