EE, NCKU Tien-Hao Chang (Darby Chang) Numerical Analysis EE, NCKU Tien-Hao Chang (Darby Chang)
Correlation coefficient What we need is a single summary number that answers the following questions: does a relationship exist? if so, is it a positive or a negative relationship? and is it a strong or a weak relationship? Correlation coefficient: A single summary number that gives you a good idea about how closely one variable is related to another variable
Correlation coefficient Two-way scatter plot Suppose that we are interested in a pair of continuous random variables For example, relationship between the percentage of children who have been immunized against the infectious DPT and mortality rate Data for a random sample of 20 countries are shown in the next slide X: the percentage of children immunized by age on year Y: the under-five mortality rate Before we do any analysis, we should create a two-way scatter plot of the data relationship exists between x and y? The mortality rate tends to decrease as the percentage of children immunized increase
Pearson’s CC In the underlying population form which the sample of points (xi,yi) is selected, the population correlation between the variables X and Y The quantifies the strength of the linear relationship between the outcomes x and y The estimator of ρ or r is known as Pearson’s coefficient of correlation or correlation coefficient
The correlation coefficient is dimensionless number; it has no units of measurement. the value r=1 and r=-1 occur when there is an exact linear relationship between x and y if y tends to increase in magnitude as x increases, r is greater than 0; x any y are said to be positively correlated if y decreases as x increases, r is less than 0 and the two variables are negatively correlated if r=0, there is no linear relationship between x and y and the variables are uncorrelated http://cclearn.npue.edu.tw/tuition/ccchen-web/教育統計學/7.pdf
http://upload. wikimedia http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png
CC is not a percent In addition to telling you whether two variables are related to one another, whether the relationship is positive or negative and how large the relationship is, The correlation coefficient tells you one more important bit of information—it tells you exactly how much variation in one variable is related to changes in the other variable A correlation coefficient is a “ratio” not a percent many students tend to think when r = .90 it means that 90% of the changes in one variable are accounted for or related to the other variable even worse, some think that this means that any predictions you make will be 90% accurate both are not correct!
Correlation Coefficient Coefficient of determination However it is very easy to translate the correlation coefficient into a percentage All you have to do is “square the correlation coefficient” which means that you multiply it by itself So, if the symbol for a correlation coefficient is “r”, then the symbol for this new statistic is simply “r2” which can be called “r squared” r2, also called the “Coefficient of Determination”, tells you how much variation in one variable is directly related to (or accounted for) by the variation in the other variable
The correlation coefficient is r = 0. 80 The correlation coefficient is r = 0.80. By squaring r to get r2, you fully 64% of the variation in scores on Variable B is directly related to how they scored on Variable A.
Statistical test
Correlation coefficient Statistical inference To test a significant correlation between two variables H0:r = 0 H1:r ≠ 0 The statistic (under H0): with n-2 degrees of freedom http://zoro.ee.ncku.edu.tw/mlb2009/res/14-ch5.pdf (pp. 9-14)
Step 1: State the hypotheses Step 2: Find the critical values Test the significance of the correlation coefficient for the age and blood pressure data suppose that n=6, r=0.897 and α=0.05 Step 1: State the hypotheses H0:r = 0 H1:r ≠ 0 Step 2: Find the critical values since α=0.05 and there are 6–2=4 degrees of freedom, the critical values are t = +2.776 and t = –2.776. Step 3: Compute the test value t = 4.059 Step 4: Make the decision reject the null hypothesis, since the test value falls in the critical region (4.059 > 2.776) Step 5: Summarize the results there is a significant relationship between the variables of age and blood pressure
Correlation coefficient Limitations It quantifies only the strength of the linear relationship between two variables Care must be taken when the data contain any outliers, or pairs of observations that lie considerably outside the range of the other data points A high correlation between two variables does not imply a cause-and-effect relationship
Four sets of data with the same correlation of 0.816 http://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/2000px-Anscombe%27s_quartet_3.svg.png Four sets of data with the same correlation of 0.816
Spearman’s Rank CC Pearson’s correlation coefficient is very sensitive to outlying values We may be interested in calculating a measure of association that is more robust One approach is to rank the two sets of outcomes x and y separately and known as Spearman’s rank correlation coefficient where xri and yri are the rank associated the ith subject rather than the actual observations
About Correlation Coefficient
Statistical inference Basic tests tests about proportions tests about one mean tests of the equality of two means tests for variances references http://zoro.ee.ncku.edu.tw/mlb2009/res/14-ch5.pdf (pp. 27-33) http://www.math.isu.edu.tw/finance/course/sta/ch8.ppt http://www.tnb.org.tw/Image/ttest.ppt http://www.mis.ncyu.edu.tw/course/download/cftai/Chapter%206.%20Continuous%20Probability%20Distribution.PPT More advanced tests ANOVA (analysis of variance) goodness of fit (Wilcoxon test, Kolmogorov-Smirnov test, …)
Multivariate analysis Statistics ANOVA Multiple linear regression http://www.sjsu.edu/faculty/gerstman/biostat-text/Gerstman_PP15.ppt http://www.stat.nuk.edu.tw/Ray-Bing/regression/regression/Chapter3.ppt PCA (principle component analysis) ICA (independent component analysis) LDA (linear discriminant analysis) So far, all techniques belong to statistics. You could find them in most statistical software, such as MATLAB, R (http://www.r-project.org/), SPSS… Machine learning Naïve Bayes (http://zoro.ee.ncku.edu.tw/mlb2009/res/11-ch4.pdf pp. 13-27) LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) RVKDE (http://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkde)