Download presentation
Presentation is loading. Please wait.
1
Correlation and the Pearson r
Dr. Richard Jackson This module covers the concept of correlation and this was widely used statistic Pearson r © Mercer University 2005 All Rights Reserved
2
Correlation Definition: Relationship between 2 (or more) Variables
Correlation may be defined as the relation that may exist between two or more variables. An example might be the relationship that exists between the height and weight between a group of individuals. Or the age and cholesterol levels of a group of individual.
3
Pearson r Relationship Between 2 Variables
Continuous and Normally Distributed Linear Relationship The most widely used correlation coefficient is known as the Pearson r. The Pearson r measures the relationship between two variables. The requirements are that the variables must both be continuous and normally distributed. Also, a linear relationship is required in order to determine if a relationship between two variables tends to be linear. A scatter diagram should be constructed.
4
Calculated Value of r Varies from –1.00 to +1.00
Indicates Direction and Magnitude of Relationship The Pearson r is calculated from a formula and the coefficient varies form minus 1 to plus 1. The calculated value for the Pearson r indicates two things. The magnitude and the direction of the relationship between the two variables. The closer the Pearson r is to minus 1 or plus 1, the greater is the magnitude of the relationship. The sign indicates the direction of the relationship. If the sign is plus it means the relationship between the two variables is positive such as the height and weight of a group of individuals. If the sign is negative it indicates that the relationship is inverse such as the average speed of an automobile and its miles per gallon or the pressure and volume of a gas.
5
Scatter Diagram Graphic Representation of each Subject on 2 Variables
A scatter diagram is graphic representation of each subject on two variables.
6
Two Variables Usually Designated X and Y
X Usually Independent Variable(Dose) Y Usually Dependent Variable (Blood Level) The two variables are usually designated as x and y, and x is usually the independent variable such as dose, and y is usually the dependent variable such as blood level. This is an example of a perfect positive relationship. We have five individuals and two measures on each of those 5 individuals. Both an x measure and a y measures. We could for example let x be dose and y be the resultant blood level.
7
Perfect Positive Relationship
X Y 20 20 30 30 40 40 50 50 60 60
8
Perfect Positive Relationship
This represents the scatter diagram of these 5 individuals and you can see it is upwards sloping to the right. This indicates a positive relationship. Since the data fall exactly on a straight line it is an example of a perfect positive relationship and the Pearson r that would be calculated from these data would be plus 1.0. r = +1.00
9
Perfect Negative Relationship
X Y 20 60 30 50 40 40 50 30 60 20 This is an example of 5 sets of data that illustrates a perfect negative relationship and as you can see as we increase the value on the x variable we decrease the value on the y variable.
10
Perfect Negative Relationship
This represents the scatter diagram for a perfect negative relationship. It is downward sloping right indicating that when we increase our x values we decrease our y values and the Pearson r that would be calculated on these data would be minus 1.0. r = -1.00
11
No Relationship X Y 20 30 30 50 40 60 50 40 60 20 These data are illustrative of no relationship existing between two variables.
12
No Relationship This scatter diagram shows that no relationship really exists between the two variables and as much as the points in the scatter points in the diagram are somewhat randomly distributed. r = 0.00
13
Points on Scatter diagram
Closer the Points Fall Along Straight Line, Closer r is to +1 or –1. More Distributed Away From Straight Line, Closer r is to O. On a scatter diagram the closer the points tend to fall along a straight line, the closer the Pearson r to either plus 1 or minus 1. The direction of the relationship will be determined by whether if it is upward sloping to the right or downward sloping to the right. The more distributed the points on a scatter diagram are away from a straight line, the closer will be the Pearson r to 0. And of course if they don’t fall on a straight line at all and just randomly scattered about the scatter diagram, the Pearson r would be zero. To the extent that they do form a straight line the Pearson r would then move closer to either plus 1 or minus 1.
14
Clinical Example Age 22 23 24 25 30 32 34 40 45 Cholesterol 190 200
215 220 210 240 Lets take a look at a clinical example to measure the relationship between age and cholesterol level. Here we see approximately 10 patients. Their ages are recorded in the left hand column and cholesterol levels in the right hand column. It is not necessary that they be arranged in any ascending or descending order. If the Pearson r were calculated on these data it would come out The plus sign indicates that it is a positive relationship between the two variables in that as one increases the other tends to increase. The numerical value indicates the degree of relationship between those two variables. r =
15
Null Hypothesis of r Ho: R = O R is Correlation in Population
Calculated r in Example = p = Conclusion: Accept Ho, No Correlation If Increase N to >12, Reject Ho To analyze this Pearson r, it is necessary that we determine whether the Pearson r that we have calculated is significantly different then 0. We begin by stating the null hypothesis which in this case is Capital R equals 0. The capital R represents the correlation that exists in the population. The symbol for the Pearson r calculated on the sample is a small letter r. What we want to do is to see if in fact the Pearson r that we calculated is significantly different from zero. It is possible that even if there is no correlation in the two variables in the population that if we selected a sample there is a possibility that we could get a high correlation simply by chance alone. So like the student t, like the F, and like chi square we calculate a Pearson r and calculate the p value associated with it. If the p value is less then 0.05, we reject the null hypothesis. In this case, the alternative hypothesis would be that r is not equal to zero. If the p is greater then 0.05, we accept the null hypothesis that the correlation in the population is zero. The p associated with this study and a Pearson r of is Therefore, since the p is greater then 0.05 we accept the null hypothesis, the capital r, the correlation of the population is equal to zero. The conclusion is that there is no correlation or relationship between these two variables. Now our calculated value of the Pearson r came out to be and you may look at that and say well that looks like a moderate degree of correlation between 0 and plus 1 but remember you must test the significance of the correlation to make any conclusion and when we test the significance, if we accept the null hypothesis and the conclusion is that there is no correlation, it means that there is no correlation in those two variables even though the calculated value looks like moderate calculations. Now it is interesting to note that if we increase our sample size to greater then 12 and if we have the same correlation coefficient we would be led to reject the null hypothesis and accept the alternative hypothesis that r is not equal to zero. In other words, the Pearson r would be significant even though the Pearson r was the same number, if we increase our sample size it might lead us to a different conclusion. So you can’t just look at the Pearson r in the sample, you have to look at the p value associated with it to tell whether or not “statistically significant” and by statistically significant we mean that it is statistically significant from zero. If it is not then we conclude that there is no correlation.
16
Statistix Program Calculates r Provides p Associated with r
The statistix program or the software package we have does calculate the Pearson r and it provides you with a p that is associated with that calculated Pearson r. A description of how to use that will be provided later.
17
Using Table for Calculated r (See Table I)
Compare calculated r with Tabled Value Exceed: Reject Ho Less Than: Accept df = N – 2 As Increase Sample Size, Tabled Value Decreases, Therefore, More Likely to Reject Ho Like the t test and chi square it is possible to use a table of values in the Pearson r to determine the significance or not. And like before, we would calculate a statistic in this case, the Pearson r and compare it to our table value. If the calculated value exceeds the table value then we reject the null hypothesis if it is less then the table value then we accept the null hypothesis. Lets take a look at table 1 which lists the table values of the Pearson r. The DF from the Pearson r is equal to N-2 where N equals the number of pairs of data. Across the top of the page you see the significance level and most likely would choose and A priori significance level of As you can see as we increase the sample size or in other words, as the DF increase the table value for r decreases. Therefore, as you increase the sample size you are more then likely to reject the null hypothesis and declare a Pearson r to be significant because the table value decreases quite a bit and therefore you are more than likely to exceed the table value with your calculated value and therefore be more inclined to conclude that the Pearson r was significantly different from zero.
18
Effect of Range on r Value of r Decreases With Range
Example: MCAT Scores The range of measures also has an impact on the calculated value for the Pearson r. If you have two measures that have very narrow ranges it is going to tend to decrease the value for the Pearson r. A good example of this is the correlation that exists between the MCAT scores and grades in medical school. Quite often, this correlation has been criticized as not correlating very well. The reason for that is that the Pearson r is decreased because of the range of scores in grades to medical school is very small. As most people make A’s. Further medical schools only take the top of the MCAT score so the range of those scores is very small. Now if everybody who took the MCAT exam was accepted to medical school you would have a high range of values on the MCAT and I assure you, you would probably have a high range of grades in school. So the Pearson r would probably high indicating a high degree relationship between these two variables.
19
Other Correlation Coefficients (Interpretation Same)
Spearman (Data Ranked) Phi: (Fourfold) Coefficient (Both Variables Dichotomous) Point Biserial (One Variable Continuous, the Other Dichotomous) There are several other correlation coefficients that appear in the literature. They include the spearman correlation coefficient, the phi (fourfold) coefficient and the point biserial.
20
Spearman X Y 1 5 2 2 3 3 Ranked on Each Variable 4 4 5 1
1 5 2 2 3 3 Ranked on Each Variable 4 4 5 1 The spearman correlation coefficient is used when the data on both variables are ordinal or ranked. As you can see we have a x variable here with subjects ranked 1 through 5 and then on the y variable it shows their rank. The spearman correlation coefficient which is interpreted just like the Pearson r which would be used for these data.
21
Phi (Fourfold) Coefficient
Sick Not Sick Drug Placebo 5 45 10 40 The Phi coefficient or the fourfold coefficient is used when both variables are dichotomous or divided into two groups. An example would be the study that we described earlier for chi square where we had a drug group and a placebo group and testing whether individuals got sick or didn’t get sick. If we were determining if there were a difference in the two groups, drug vs. placebo, we would use chi square. The phi coefficient or fourfold coefficient tells you the degree of relationship between the two variables. Obviously if there is a difference there is probably going to be a relationship and the fourfold or the phi coefficient would be a very strong one. The Phi coefficient tells you about the relationship that exists between the two variables. Chi square tells you if there is a difference.
22
Point Biserial (1 Continuous, 1 Dichotomous)
Student 1 2 3 4 5 Question #1 R W Test Score 93 91 71 95 80 The point biserial correlation coefficient is used when one variable is continuous and the other is dichotomous or divided into two groups. The Point Biserial is used quite often in the analysis of questions on exams. On this slide of an analysis regarding question number 1 on an exam in 5 students. If you have a good question on an exam, the students that score high on the exam as a whole should get the question right and those who score low on the entire exam should get the question wrong. This particular situation illustrates the Point Biserial. In this case those people who scored high, got the question right and those getting low scores on the exam go the question wrong.
23
Multiple Correlation Determines Relationship Between Several Variables at Once on One Variable Example: Relationship Between Age, Weight, Sex and Dose of Drug on Blood Level The Point Biserial, the Phi coefficient, and the Spearman correlation coefficient are all interpreted just like the Pearson r. There is another type of correlation that you might see reported in the literature. It is called multiple correlation. It is used to determine the relationship between several variables at once on one particular variable. For example, one could do a multiple correlation to determine the relationship that exists between age, weight, sex, and dose of the drug, all four of those variables taken together on the blood level of a drug.
24
How to Use Statistix to Analyze Data with the Pearson r
Enter Data for Two Variables Select Statistics, Linear Models, Correlations (Pearson) Highlight and Move Variable Names to Correlation Variables Box Check Compute p Values Box then OK Analyze Pearson r Data in Problems at End of Linear Regression Module
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.