LECTURE 5 Correlation
Correlation The Pearson Product-Moment Correlation Coefficient (r), or correlation coefficient for short is a measure of the degree of linear relationship between two variables The correlation coefficient may take on any value between plus and minus one (-1<r<1). The sign of the correlation coefficient (+ , -) defines the direction of the relationship, either positive or negative. A positive correlation coefficient means that as the value of one variable increases, the value of the other variable increases; as one decreases the other decreases. A negative correlation coefficient indicates that as one variable increases, the other decreases, and vice-versa.
Correlation Taking the absolute value ( Nilai Mutlak ) of the correlation coefficient measures the strength of the relationship. A correlation coefficient of r = 0.50 indicates a stronger degree of linear relationship than one of r = 0.40. A correlation coefficient of r = -0.50 shows a greater degree of relationship than one of r = 0.40. A correlation coefficient of zero (r = 0.0) indicates the absence of a linear relationship and correlation coefficients of r = +1.0 and r = -1.0 indicate a perfect linear relationship.
Assumption in Correlation Analysis Related Pairs Data must be collected from related pairs. Scale of Measurement Data should be interval or ratio in nature An exception to the preceding rule occurs when the nominal categorical scale is dichotomous (has two levels) or as long as a larger number means that the object has more of something or another (Ordinal) Normality Data should be normally distributed Linearity The relationship between 2 variables must be linear. Homoscedasticity Variance between the 2 variables should be roughly the same
Scatterplot The scatterplot can be used to examine the linearity and homoscedascity of the data. To determine linearity, the points on the scatterplot should cluster around a straight line. To determine whether the homoscedascity has not been violated, the points on the scatterplot should be clustered uniformly. The scatterplot is illustrates how the correlation coefficient changes as the linear relationship between the two variables is altered. When r=0.0 the points scatter widely about the plot, the majority falling roughly in the shape of a circle. As the linear relationship increases, the circle becomes more and more elliptical in shape until the limiting case is reached (r=1.00 or r=-1.00) and all the points fall on a straight line.
Scatterplots r = 0.17 r = -0.33 r = 0.39 r = 0.42
Scatterplots r = -0.54 r = 0.85 r = -0.94 r = 1.00
The Correlation Matrix The correlation matrix is a convenient way of summarizing a large number of correlation coefficients by putting them in in a single table. A Correlation Matrix is a table of all possible correlation coefficients between a set of variables. For example, if there are five questions on the questionnaire there are 5 * 5 = 25 different possible correlation coefficients to be computed. Each computed correlation is then placed in a table with variables as both rows and columns at the intersection of the row and column variable names.
The Correlation Matrix - Example For example, consider a questionnaire with the following variables. AGE - What is your age? _____ KNOW - Number of correct answers out of 10 possible to a Geology quiz which consisted of correctly locating 10 states on a state map of the United States. VISIT - How many states have you visited? _____ COMAIR - Have you ever flown on a commercial airliner? _____ SEX - 1 = Male, 2 = Female one could calculate the correlation between AGE and KNOWLEDGE, AGE and STATEVIS, AGE and COMAIR, AGE and SEX, KNOWLEDGE and STATEVIS, etc.
The Correlation Matrix - SPSS To calculate a correlation matrix using SPSS select CORRELATIONS and BIVARIATE. Select the variables that are to be included in the correlation matrix.
The Correlation Matrix – SPSS Option For quantitative, normally distributed variables, choose the Pearson correlation coefficient. If your data are not normally distributed or have ordered categories (Ordinal), choose Kendall’s tau-b or Spearman. Test of Significance: You can select two-tailed or one-tailed probabilities. If the direction of association is known in advance, select One-tailed. Otherwise, select Two-tailed. Flag significant correlations. Correlation coefficients significant at the 0.05 level are identified with a single asterisk, and those significant at the 0.01 level are identified with two asterisks.
The Correlation Matrix - Output
The Correlation Matrix - Analysis The strongest relationship was between the number of states visited and whether or not the student had flown on a commercial airplane (r=.42) which indicates that if a student had flown he/she was more likely to have visited more states. Age was positively correlated with number of states visited (r=.22) and flying on a commercial airplane (r=.19) with older students more likely both to have visited more states and flown, although the relationship was not very strong. The greater the number of states visited, the more states the student was likely to correctly identify on the map, although again relationship was weak (r=.28). sex of the participant was slightly correlated with both age, (r=.17) indicating that females were slightly older than males, and number of states visited (r=-.16), indicating that females visited fewer states than males (Female is coded 2 and Male is coded 1)
Partial Correlation Partial correlation provides us with a single measure of linear association between 2 variables while adjusting for the effects of one or more additional variable. For example, there seems the be a high correlation between consumption of slimming product and weight lost. There might be a additional variable that increased or decreased the relationship between these 2 variable. (i.e. Amount of Exercise) By controlling the Amount of Exercise, we can better analyze whether indeed there is a high correlation between consumption of slimming product and weight lost.
Correlations and Causation It is possible for two variables to be related (correlated), but not have one variable cause another. For example, suppose there exists a high correlation between the number of ice cream sold and the number of drowning deaths. Does that mean that one should not eat ice-cream before one swims? Not necessarily. Both of the above variable are related to a common variable, the heat of the day. The hotter the temperature, the more ice-cream sold and also the more people swimming, thus the more drowning deaths. This is an example of correlation without causation.