Correlation
We now investigate the relationships that can exist among continuous variable. Correlation analysis: Correlation is defined as the quantification of the degree to which two random variables are related, provided that the relationship is linear.
17.1 Two-Way Scatter Plot Suppose that we are interested in a pair of continuous random variables. Example, relationship between the percentage of children who have been immunized against the infectious DPT and mortality rate. Data for a random sample of 20 countries are show the figure 17.1. (Table 17.1) X: the percentage of children immunized by age on year Y: the under-five mortality rate Before we do any analysis, we should create a two- way scatter plot of the data. (relationship exists between x and y??) The mortality rate tends to decrease as the percentage of children immunized increase.
17.1 Two-Way Scatter Plot
17.2 Pearson’s Correlation Coefficient In the underlying population form which the sample of points (xi,yi) is selected, the population correlation between the variables X and Y. (Greek letter: r; read rho) The quantifies the strength of the linear relationship between the outcomes x and y. The estimator of r is known as Pearson’s coefficient of correlation or correlation coefficient (r).
17.2 Pearson’s Correlation Coefficient The sample correlation coefficient is denoted by r. sx and sy are the sample standard deviations of the x and y values.
The correlation coefficient is dimensionless number; it has no nuits of measurement. The value r=1 and r=-1 occur when there is an exact linear relationship between x and y. (Figure 17.2 (a)(b)) If y tends to increase in magnitude as x increases, r is greater than 0; x any y are said to be positively correlated. (r >0) If y decreases as x increases, r is less than 0 and the two variables are negatively correlated. (r <0) If r=0, there is no linear relationship between x and y and the variables are uncorrelated. (r =0) (Figure 17.2 (c)(d)) Page 401
17.2 Pearson’s Correlation Coefficient
In this sample: Strong linear relationship Negative association: mortality rate decreases in magnitude as percentage of immunization increases The correlation coefficient merely tells us that a linear relationship exists between two variables; it does not specify whether the relationship is cause-and-effect. We would also like to be able to draw conclusions about the unknown population correlation using the sample correlation coefficient r. 17.2 Pearson’s Correlation Coefficient
H0: =0 (No association between X and Y) H1: ≠0 (association between X and Y) The estimated standard error of r : The statistic (under H0): If we assume that the pairs of observations were obtained randomly and both X and Y are normally distribution. If is equal to some other value, represented by 0, the sampling distribution is skewed, and the test statistic no longer follow at t distribution. 17.2 Pearson’s Correlation Coefficient
The coefficient of correlation r has several limitations: It quantifies only the strength of the linear relationship between two variables. Care must be taken when the data contain any outliers, or pairs of observations that lie considerably outside the range of the other data points. The estimated correlation should never be extrapolated beyond the observed ranges of the variables; the relationship between X and Y may change outside of this region. A high correlation between two variables does not imply a cause-and-effect relationship. 1. 若非線性關係,r則無法測出相關性。 2.若有多個極值存在的話,可能會導致錯誤的結果。 3. 不能估計落在變數範圍外之相關係數 17.2 Pearson’s Correlation Coefficient 11
17.3 Spearman’s Rank Correlation Coefficient Pearson’s correlation coefficient is very sensitive to outlying values. We may be interested in calculating a measure of association that is more robust. One approach is to rank the two sets of outcomes x and y separately and known as Spearman’s rank correlation coefficient.(non-parametric method) Spearman’s rank correlation coefficient: Where xri and yri are the rank associated the ith subject rather than the actual observations.
An equivalent method for computing rs is provided by n: the number of data points in the sample di is the different between the rank of xi and the rank of yi -1 ≤ rs ≤ 1 High degree of correlation between x any y: rs =-1 or 1 A lack of linear association between two variables: rs= 0 When type of data is ordinal or the conditions do not hold, we should used rs . 17.3 Spearman’s Rank Correlation Coefficient
Spearman’s rank correlation coefficient may also be thought of as a measure of the concordance(一致性) of the ranks for the outcomes x and y. Case I Nation Percentage Immunized Rank Mortality rate di Ethiopia 13 1 6 Cambodia 32 2 7 Senegal 47 3 8 … Czech Republic 99 20 208 17.3 Spearman’s Rank Correlation Coefficient
Case II Nation Percentage Immunized Rank Mortality rate di Ethiopia 13 208 20 -19 Cambodia 32 2 184 19 -17 Senegal 47 3 145 18 -15 … Czech Republic 99 6 17.3 Spearman’s Rank Correlation Coefficient
If n is not too small and if we can assume that pairs of ranks are chosen randomly, we can test null hypothesis: H0: =0. The test statistic is This testing procedure does not require that X and Y be normally distributed. About rs : It is much less sensitive to outlying values than Pearson’s correlation coefficient. It can be used when one or both of the relevant variables are ordinal. It relies on ranks rather than on actual observations. 17.3 Spearman’s Rank Correlation Coefficient