Presentation is loading. Please wait.

Presentation is loading. Please wait.

Principles of Biostatistics Chapter 17 Correlation 宇传华 网上免费统计资源(八)

Similar presentations


Presentation on theme: "Principles of Biostatistics Chapter 17 Correlation 宇传华 网上免费统计资源(八)"— Presentation transcript:

1 Principles of Biostatistics Chapter 17 Correlation 宇传华 http://statdtedm.6to23.comhttp://statdtedm.6to23.com 网上免费统计资源(八)

2 Terminology scatter plot 散点图 correlation 相关 linear correlation 直线相关 correlation coefficient 相关系数 Pearson’s correlation coefficient Pearson 相关系数 Spearman’s rank correlation coefficient Spearman 等级 相关系数

3 §17.1 The Two-Way Scatter Plot CONTENTS §17.2 Pearson’s Correlation Coefficient: r §17.3 Spearman’s Correlation Coefficient: r s §17.4 Further Application

4 The correlation between two random variables, X and Y, is a measure (指标) of the degree of linear association between the two variables. The population correlation, denoted by    Greek letter, Symbol 字体,读音 rou  The sample correlation, denoted by r (Latin letter or English letter),   (r)  can take on any value from - 1 to 1. The correlation between two random variables, X and Y, is a measure (指标) of the degree of linear association between the two variables. The population correlation, denoted by    Greek letter, Symbol 字体,读音 rou  The sample correlation, denoted by r (Latin letter or English letter),   (r)  can take on any value from - 1 to 1.  ( r )  indicates a perfect negative linear relationship  indicates a perfect positive linear relationship  indicates no linear relationship The absolute value of  indicates the strength ( 强度 ) of the relationship. -1<  <0 indicates a negative linear relationship 0<  <1 indicates a positive linear relationship The sign of  indicates the Direction ( 方向 ) of the relationship.  ( r )  indicates a perfect negative linear relationship  indicates a perfect positive linear relationship  indicates no linear relationship The absolute value of  indicates the strength ( 强度 ) of the relationship. -1<  <0 indicates a negative linear relationship 0<  <1 indicates a positive linear relationship The sign of  indicates the Direction ( 方向 ) of the relationship. Correlation (coefficient)

5 Before we conduct correlation analysis, we should always created a two-way scatter plot (scatter diagram). X variable------horizontal axis Y variable------vertical axis; each point on the graph represents a combination value (X i,Y i ). Through scatter plot, we can often determine whether a linear relationship exists between X and Y. One statistical technique often employed to measure such an association is known as correlation analysis

6 §17.1 The Two-Way Scatter Plot 表 凝血酶浓度( X )与凝血时间( Y )间的关系

7 Scatter Plot

8 Perfect positive Strong positive Positive correlation r = 1 correlation r = 0.99 correlation r = 0.80 Strong negative No correlation Non-linear correlation correlation r = -0.98 r = 0.00

9 The important of a scatter plot In the next chapter (simple linear regression), we also need a scatter plot to find if the relationship between X and Y is a linear relationship, if the relationship between X and Y is a positive linear relationship. So, before the analysis of correlation and regression, we should usually make a scatter plot

10 §17.2 Pearson’s correlation coefficient ( r) Synonyms: product moment ( 积矩 ) correlation coefficient simple linear (简单线性) correlation coefficient Definition: intensity (strength) direction r-------A statistical index to describe the intensity (strength) and the direction of association between two variables (X,Y). r is a dimensionless number( 无量纲数 );it has no units of measurement -1≤r ≤ 1

11 X,Y: random variables following normal distribution ( Bivariate Normal Distribution ). both X i and Y i are measured from the same subject ith

12 How do we calculate r?

13 Subject i Concentration of thrombin x (u/ml) Clotting time y (second) x2x2 y2y2 x×y 11.1141.2119615.4 21.2131.4416915.6 31.0151.0022515.0 40.9150.8122513.5 51.2131.4416915.6 61.1141.2119615.4 70.9160.8125614.4 80.6170.3628910.2 91.0141.0019614.0 100.9160.8125614.4 111.1151.2122516.5 120.9160.8125614.4 131.1141.2119615.4 141.0151.0022515.0 150.7170.4928911.9 sum 14.722414.813368216.7 xx yy y2y2  xy x2x2

14 l XX =0.404 , l YY =22.933 , l XY =-2.82 2) Calculation of r X,Y : stronger negative relationship

15 Inference about correlation coefficient r ---------- hypothesis test 1)Establish testing hypothesis, determining significant level α H 0 :  =0 no linear association between X and Y H 1 :  ≠0 linear association between X and Y exists  =0.05 two-sided probability of type I error

16 2) Calculating statistic =n-2 For the above example =15-2=13 From t distribution table (Table A4,Appendix), the critical value is t 0.05/2(13) =2.160 < |t|=8.874,  P<0.05, Correlation coefficient is statistically significant at α=0.05. concentration of thrombin and clotting time are negatively related.

17 §17.3 Spearman’s Rank Correlation Coefficient: r s Spearman 等级相关系数 rank 可翻译为: 秩,等级 Spearman‘s rank correlation ( a method of nonparametric test ) is applied if two variables are distributed far from normal. i.e. the normality requirement is not satisfied

18 The steps of hypothesis test Rank ordering according to its magnitude of values for each of the two variables (X i,Y i ) (X ri, Y ri ) Calculating the Spearman’s rank correlation coefficient based on the ranks

19 Table hemorrhage degrees and thrombocyte counts (109/L) from 12 children of acute leukemia Patient i plateletX i Rank:X ir (X ir ) 2 Bleeding Y i Rank: Y ir (Y ir ) 2 X ir × Y ir (1)(2) (3)(4)(5)(6)(7)(8) 1 121 1 1+++11.5132.2511.5 2 138 2 4++ 9.0 81.0018.0 3 165 3 9+ 7.0 49.0021.0 4 310 4 16– 3.5 12.2514.0 5 426 5 25++ 9.0 81.0045.0 6 540 6 36++ 9.0 81.0054.0 7 740 7 49– 3.5 12.2524.5 81060 8 64– 3.5 12.2528.0 91260 9 81– 3.5 12.2531.5 10129010100– 3.5 12.2535.0 11143811121+++11.5132.25 126.5 12200412144– 3.5 12.2542.0 total7865078630451 For tie (equal) ranks, mean rank is used instead. Six ‘–’s, mean=(1+2+3+4+5+6)/6=3.5

20 Calculation of r s (numerical values are from Table above) Patientplatelet Rank:X ir (X ir ) 2 BleedingRank: Y ir (Y ir ) 2 X ir * Y ir (1)(2) (3)(4)(5)(6)(7)(8) total7865078630451

21 Because there are some tie ranks in Y we can not use the formula latter.

22 (1) - 1≤r s ≤1 and similar meaning as r does (2) Difference between r s and r. r s ≠ r Calculated by ranks Calculated by original values of data Explanation of Spearman’s rank correlation coefficient: r s

23 Statistical inference about r s 1) Setting up hypothesis, determining significant level H 0 :  s =0 H 1 :  s  0  =0.05 2) Calculating test statistic 3) Conclusion: No association between platelet( 血小 板 ) and bleeding (出血).

24 Notices in application 1. r=0 does not mean no correlation (might be non-linear correlation) Y X Y X Y X H 0 :  =0

25 Notices in application 2.When levels of either variable X or Y are artificially selected , it is not suitable to make Pearson’s correlation analysis ( but we can do spearman’s rank correlation analysis ). Pearson’s correlation analysis requires that both X and Y follows normal distribution.

26 Notices in application 3. Outliers can affect correlation coefficient heavily.

27 Notices in application 4. Correlation  cause-effect association( 因果联系 ), Correlation  intrinsic association (固有联系). 5. The difference between statistical significance (P value)  intensity of correlation (absolute value of r ) : There are statistical significance of correlation coefficient ------ the probability of r from the  =0 is small (P value is small). Intensity of correlation ----the absolute value of r

28 DATA EXP17_12; INPUT X Y; CARDS; 77 118 69 65 32 184 85 8 94 43 99 12 89 55 13 208 95 7 95 9 54 9 §17.4 Further Application 89 124 95 10 87 6 91 33 98 16 73 32 47 145 76 87 90 9 ; PROC CORR PEARSON SPEARMAN; VAR X Y; RUN; SAS Codes for textbook’s Table 17.1 and Table 17.2

29 The CORR Procedure 2 Variables: X Y Simple Statistics Variable N Mean Std Dev Median Minimum Maximum X 20 77.40000 23.65409 88.00000 13.00000 99.00000 Y 20 59.00000 63.86581 32.50000 6.00000 208.00000 Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0 X Y X 1.00000 -0.79107 <.0001 Y -0.79107 1.00000 <.0001 Spearman Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0 X Y X 1.00000 -0.54319 0.0133 Y -0.54319 1.00000 0.0133

30 DATA EXP17_34; INPUT X Y; CARDS; 5 600 100 3 98 67 84 170 100 6 99 15 70 120 50 170 26 300 6 830 100 10 37 800 35 500 96 60 55 100 90 10 96 5 99 5 99 8 95 120 ; PROC CORR PEARSON SPEARMAN; VAR X Y; RUN; SAS Codes for textbook’s Table 17.3 and Table 17.4

31 The CORR Procedure 2 Variables: X Y Simple Statistics Variable N Mean Std Dev Median Minimum Maximum X 20 72.00000 33.79193 92.50000 5.00000 100.00000 Y 20 194.95000 268.92211 83.50000 3.00000 830.00000 Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0 X Y X 1.00000 -0.87681 <.0001 Y -0.87681 1.00000 <.0001 Spearman Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0 X Y X 1.00000 -0.88969 <.0001 Y -0.88969 1.00000 <.0001

32 1.Simple linear correlation coefficient: r Condition: Both X and Y variables follow the normal distribution. 2.Spearman’s rank correlation coefficient: r s It does not require that X or Y follows the normal distribution. SUMMARY

33 Assignment Review Exercises 5. (pp. 412)


Download ppt "Principles of Biostatistics Chapter 17 Correlation 宇传华 网上免费统计资源(八)"

Similar presentations


Ads by Google