Inferential Statistics

Inferential Statistics
Correlations Inferential Statistics

Overview Correlation coefficients Scatterplots Calculating Pearson’s r
Interpreting correlation coefficients Calculating & interpreting coefficient of determination Determining statistical significance Calculating Spearman’s correlation coefficient

Correlation Reflects the degree of relation between variables
Calculation of correlation coefficient Direction + (positive) or – (negative) Strength (i.e., magnitude) Further away from zero, the stronger the relation Form of the relationship Correlation – reflects the degree of relation between variables. We calculate a correlation when we want to examine it there is an association or link between two variables. As previously discuss, a correlation coefficient ranges from to If you ever calculate a correlation coefficient and end up with something like 1.23 or -1.88, you know you have done something wrong. The +/- sign of the coefficient let’s us know what the direction of the correlation is. Positive means that the variables are going in the same direction (tend to both increase and decrease in the same direction). For example, as study time increases, grades tend to increase. Negative means the variables are going in opposite directions. For example, as depression increases, amount of work completed decreases. The strength or magnitude of the correlation is gleaned from the absolute value of the correlation coefficient. The further away from zero, the stronger the relation. Both and represent perfect relationships and are the strongest possible correlations. They represent a highly consistent relation between the two variables. The closer to zero, the more inconsistency in the relationship – in which the variables don’t seem to be associated. We can also get information about the form of the relationship. That is, is it linear or curvilinear? Typically, this pattern can be seen in scatterplots. For this class, we will focus on linear relationships.

Check yourself Indicate whether the following statements suggest a positive or negative relationship: High school students with lower IQs have lower GPAs More densely populated areas have higher crime rates Heavier automobiles yield poorer gas mileage More anxious people willingly spend more time performing a simple repetitive task Go through the items above to determine whether they represent positive or negative correlations. Answers: High school students with lower IQs have lower GPAs. There is a positive correlation such that among high school students lower IQ is associated with lower GPA. More densely populated areas have higher crime rates. Positive – as population increases, crime rates tend to also increase. Heavier automobiles yield poorer gas mileage. Negative – as automobile weight increases, gas mileages tends to decrease. More anxious people willingly spend more time performing a simple repetitive task. Positive – as anxiety increases, willingness to spend time on repetitive tasks increases.

Scatterplots Scatterplots can be used to represent the relation between variables. One variable is labeled X and one is labeled Y. Typically, the X variable is the predictor variable (in this case wine consumed) and the Y variable is the criteria or outcome variable (in this case hear disease deaths). Each data point is a point on the graph is essentially a participant. Researchers plot participants’ scores on each of the variable to visually see if there is a association between the two variables. In this case, we can see that there is a negative relationship between the two.

Correlation & Scatterplots
Exam1 X Exam2 Y Participant1 100 95 Participant2 60 65 Participant3 75 80 Participant4 85 Participant5 Participant6 70 Participant7 r = .91 Good idea to graph the data Determine the form of the relationship (linear or nonlinear) Determine if there are any outliers (extreme scores) Rough guess as to what the correlation coefficient will be Benefits of scatterplot Form of relation Any possible outliers? Rough guess of r

Number of Arrests X GPA Y Participant1 4.0 Participant2 5 3.7 Participant3 10 2.8 Participant4 20 2.5 Participant5 30 1.0 Graph these data.

Number of Arrests X GPA Y Participant1 4.0 Participant2 5 3.7 Participant3 10 2.8 Participant4 20 2.5 Participant5 30 1.0 # of times arrested

Pearson’s r Formula SP = Sum of products (of deviations)
SSx = Sum of Squares of X SSy = Sum of Squares of Y By far the most common correlation is the Pearson correlation (or the Pearson product moment correlation) which measures the degree of straight-line relationships between two variables on ratio scales of measurement. To calculate Person correlation, it is necessary to introduce one new concept: the sum of products of deviation, or SP. In the past, we used a similar concept, sum of squared deviations to measure variability of a single variable. Now we will use SP to measure the amount of covariability between two variables. The denominator is the square root of the sum of squares (SS) for the X scores times the sum of squares (SS) for the Y scores. Notice the subscript is used to differentiate the two sums of squares. We will get back to the denominator in a second.

Pearson’s r Calculating SP Find X & Y deviations for each individual
Definitional formula Computational formula Find X & Y deviations for each individual Find product of deviations for each individual Sum the products Similar to calculating SS, there are two formulas we can use to calculate the sums of products (SP). One is the definitional formula. First, find the deviations from the mean for X and Y separately, Second, find the product of deviations for each individual, Third, sum the products. The other is the computational formula.

Example #1 Calculating SP – Definitional Formula
Step 2: Multiply the deviations from the mean X Y 2 4 3 5 7 X-MX Y-MY 2-3 = -1 2-4 = -2 4-4 = 0 3-3 = 0 3-4 = -1 5-3 = 2 7-4 = 3 (X-MX)(Y-MY) (-1)(-2) = 2 (-1)(0) = 0 (0)(-1) = 0 (2)(3) = 6 Step 3: Sum the products Here, you can see the steps for the definitional formula. First, we find the deviations for X and Y separately. So, we determine the mean for our X scores, which in this case is 3. We subtract three from each X score to find how much each score deviates from the X mean. Then we repeat this process for the Y scores. We determine the mean for our Y scores, which is 4. Next, we subtract the mean (4) from each Y score to determine how much each score deviates from the Y mean. Second, we find the product of the deviations – we multiply the deviations from the mean for our X scores with the deviations from the mean for our Y scores. Lastly, we sum the products. This is our SP. SX = 12 SY = 16 MX = SX/n = 12/4 = 3 MY = SY/n = 16/4 = 4 Step 1: Find deviations for X and Y separately

Example #1 Calculating SP – Computational Formula
Y 2 4 3 5 7 XY 4 8 9 35 For the computational formula, the sum of products equals the sum of X times Y minus the sum or X times the sum of Y divided by N The sum of X is 12 The sum of Y is 16 The sum of X times Y is 56 So to put in the formulas it would be: 56-(12*16)/4 56-192/4 56-48=8 SX = 12 SY = 16 MX = SX/n = 12/4 = 3 MY = SY/n = 16/4 = 4

Calculating Pearson’s r
Calculate SP Calculate SS for X Calculate SS for Y Plug numbers into formula Here are all the steps needed to calculate Pearon’s r. First, we calculate SP using either the definitional or computational formula. Note, SP can be a positive or negative number. Second, we calculate SS for X scores using either the definitional or computational formula. Third, we calculate SS for Y scores using either the definitional or computational formula. Lastly, we plug each of the numbers into the formula.

Calculating Pearson’s r
Calculate SP Calculate SS for X Calculate SS for Y Plug numbers into formula Here are the numbers from the previous problem in which we’ve already determined that SP is 8. Try completing steps 2-4 on your own. X Y 2 4 3 5 7

Example #1 - Answers Calculating Pearson’s r
Y 2 4 3 5 7 X-MX Y-MY 2-3 = -1 2-4 = -2 4-4 = 0 3-3 = 0 3-4 = -1 5-3 = 2 7-4 = 3 (X-MX)(Y-MY) (-1)(-2) = 2 (-1)(0) = 0 (0)(-1) = 0 (2)(3) = 6 XY 4 8 9 35 X2 Y2 4 16 9 25 49 (X-MX)2 (Y-MY)2 1 4 9 Here are the answers. You can see that I’ve used both the definitional and computational formulas. SX = 12 SY = 16 MX = SX/n = 12/4 = 3 MY = SY/n = 16/4 = 4

Pearson’s r r = covariability of X and Y
variability of X and Y separately It is important to think about what that correlation coefficient really is.

Using Pearson’s r Prediction Validity Reliability
Although Pearon’s r can be used in many different ways, here are a few examples. It is important for prediction. If two variables are known to be related in some systematic way, it is then possible to use one of the variables to make accurate predications about the other. It is also important for determining validity. Remember, back we discussed scale validation – that is, are we measuring what we are suppose to be measuring. We talked about several types of variability, such as construct validity and predictive validity. It can also be used to say something about reliability. That is, the consistency between two measures. For example, remember test-retest and interrater reliability. Ideally, we would want a strong positive correlation between scores on tests or observers’ ratings.

Verbal Descriptions 1) r = -.84 between total mileage & auto resale value 2) r = -.35 between the number of days absent from school & performance on a math test 3) r = -.05 between height & IQ 4) r = .03 between anxiety level & college GPA 5) r = .56 between age of schoolchildren & reading comprehension level Take a moment to interpret these correlation coefficients. Cars with more total miles tend to have lower resale values. Students with more absences from school tend to score lower on math tests. Little or no relationship exists between height and IQ Little or no relationship exists between anxiety level and college IQ Older schoolchildren tend to have better reading comprehension It is important that in your verbal descriptions to steer clear of any causal claims.

Interpreting correlations
Describe a relationship between 2 vars Correlation does not equal causation Directionality Problem Third-variable Problem Restricted range Obscures relationship When interpreting correlation coefficients it is important to remember that ‘correlation does not equal causation’. We previously discussed the directionality problem – that is, if the two variables are measured at similar points in time, we cannot be sure which variable is the cause and which is the effect. Does depression cause low self-esteem? Or does low self-esteem cause depression? There is also the third variable problem, in which is possible that a third unmeasured variable may be responsible for both. Perhaps it is chemical imbalance in serotonin levels that influences both depression and self-esteem. We also have to be careful concerning the restriction of range, arises when the range between the lowest and highest scores on one or both variables is limited. This will reduce the accuracy of the correlation coefficient, producing a coefficient that is smaller than it would be if the range were not restricted.

Outliers Can have BIG impact on correlation coefficient We also have to beware of outliers (i.e., extreme scores), which can have a big impact on the correlation coefficient. Here you can see that from subjections A through E, the correlation is Basically, no correlation between the two variables. However, which subject F is added into the mix, we see that the correlation jumps to .86 suggesting a strong positive correlation. It is clear though that this relationship is being driven by one subject who is very different from the other subjects. Source: Figure is from page 462 of the Gravetter and Wallnau textbook.

Strength & Prediction Coefficient of determination r2 Proportion of variability in one variable that can be determined from the relationship w/ the other variable r = .60, then r2 = .36 or 36% 36% of the total variability in X is consistently associated with variability in Y “predicted” and “accounted for” variability When multiplied by 100, the coefficient of determination indicates the percentage of variance accounted for. Essentially, it is the proportion of variablity in one variable that can be determined from the relationship with the other variable. It is used to determine effect size for correlations. Low values for r2 indicate that factors, other than the two variables of interest, are influencing the relationship. For example, if we had a correlation of .60 – then 36% of the variability in the Y scores can be predicted from the relationship with X scores. Example (hours watching TV/amount of aggression) r = .875 r2 = 76.56% The relationship between hours spent watching TV and amount of aggression accounts for a large percentage of the variance 76.56% of the variability in the Y scores can be predicted from the relationship with X There are few other factors that influence this relationship

Mini-Review Correlations Calculation of Pearson’s r Using Pearson’s r
Sum of product deviations Using Pearson’s r Verbal descriptions Interpretation of Pearson’s r

Example #2 Practice – Calculate Pearson’s r
Calculate SP Calculate SS for X Calculate SS for Y Plug numbers into formula Ex 1 X Y 2 9 1 10 3 6 8 4 Complete this problem.

Example #2 SP = S(X-MX)(Y-MY) SSY SSX X Y 2 9 1 10 3 6 8 4 X-MX Y-MY
8 4 X-MX Y-MY 2-2 = 0 9-7 = 2 1-2 = -1 10-7 = 3 3-2 = 1 6-7 = -1 0-2 = -2 8-7 = 1 4-2 = 2 2-7 = -5 (X-MX)(Y-MY) (0)(2) = 0 (-1)(3) = -3 (1)(-1) = -1 (-2)(1) = -2 (2)(-5) = -10 XY 18 10 8 X2 Y2 4 81 1 100 9 36 64 16 (X-MX)2 (Y-MY)2 4 1 9 25 SX = 10 SY = 35 MX = SX/n = 10/5 = 2 MY = SY/n = 35/5 = 7 SSX

Hypothesis Testing Making inferences based on sample information
Is it a statistically significant relationship? Or simply chance? We’ve already discussed using a sample mean to answer questions about a population mean So, we know some of the basic principles H0: r = 0 In population – no effect, no difference, no relationship H1: r not equal 0 In population – a change, a relationship, a difference

Conceptually - Degrees of freedom
Knowing M (the mean) restricts variability in sample 1 score will be dependent on others n = 5, SX = 20 If we know first 3 scores If we know first 4 scores Score X1 = 6 X2 = 4 X3 = 2 X4 X5 Σx = 20 = 3 With n=5, there can be only 4 df = 5 With a population, we find the deviation for each score by measuring its distance from the population mean (x – “mew”) With the sample, the population mean is unknown – must measure distances from sample mean. This requires that you know the sample mean to compute deviations. Knowing the value of mean places a restriction on the variability of the scores in the sample. Meaning, one score will always be restricted (i.e., dependent on the other ones). Because we have to know M for computation of s2 (which is variance), one score will always be restricted (i.e. dependent on the other ones) A sample with n scores will generally have n-1 scores that can vary, but the last score is restricted. We call this the sample has n-1 degrees of freedom For example, if we have 5 scores and we know the sum of the 5 scores is 20. If you know the first 3 scores, scores 4 and 5 could be almost any combination If you know the first 4 scores, the 5th score is determined --- here it must be 3

Correlations – Degrees of freedom
There are no degrees of freedom when our sample size is 2. When there are only two points on a scatterplot, they will fit perfectly on a straight line. Thus, for correlations df = n – 2 For correlations, the degrees of freedom = n – 2 With n = 2 (essentially as sample size of two people), there are no degrees of freedom. When there are only two points, they will ALWAYS fit perfectly on a straight line. Because the first two points always produce a perfect correlation – the sample is free to vary only when the data set contains more than two points.

Using table to determine significance
Find degrees of freedom Correlations: df = n – 2 Use level of significance (e.g., a = .05) for two-tailed test to find column in Table Determine critical value Value calculated r must equal or exceed to be significant Compare calculated r w/ critical value If calculated r less than critical value = not significant APA The correlation between hours watching television and amount of aggression is not significant, r (3) = -.80, p > .05. In order to determine statistical significant consult a table – like the one found here - Step 1: Examine the df column (N – 2). N stands for the number of pairs of scores. (Example, 5 pairs of scores, then 3 degrees of freedom) Step 2: Find the column for .05, two-tailed test Table for critical values for the Pearson correlation Step 3: Determine the critical value (the value that our calculated r must equal or exceed to be significant). For example, the critical value = .8783 Step 4: Compare your calculated r with the table value to determine significance For example, r = .875, which is LESS than Not significant. Step 5: Write an APA format conclusion. IF NOT SIGNIFICANT: The correlation between hours watching television and amount of aggression is not significant, r(3) = .88, p > .05. IF SIGNIFICANT: The more hours people spend watching television, the greater the amount of aggression, r(3) = .90, p > .05. OR there was a strong positive correlation between hours spent watching television and aggression amounts, r(3) = .900, p > .05. Notice that our correlation coefficient of .875 is statistically significant, but the correlation is strong. This is because our sample size is so small. Notice, the effect that sample size has on statistical significance. Think about sample size

Difference between pair of ranks
Spearman correlation Used when: Ordinal data If 1 variable is on ratio scale, then change scores for that variable into ranks Difference between pair of ranks Spearman correlation is another type of correlation coefficient. When both of our variables are on an ordinal scale of measurement, then Spearman is the appropriate correlation. It can also be used when we have one ordinal level variable in one interval/ration level variable. We would actually need to first convert the interval/ratio variable into an ordinal level variable. It can also be calculated for nonlinear relationships – “measure consistency of a relationship between x and y, independent of the specific form of the relationship.”

Example #3: Spearman correlation
1st race 2nd race 4 3 1 2 9 8 6 5 7 Try calculating this correlation. First, notice that we have two variables place in race 1 and place in race 2. There are nine individuals who ran in two races and we would like to know if there is a relationship between where they place in the first race and where they place in the second race. Place is on an ordinal scale of measurement.

Example #3: Spearman - Answers
1st race 2nd race 4 3 1 2 9 8 6 5 7 D 1 -1 2 -2 D2 1 4 In terms of calculation, we determine the difference (D) between places for the two races by subtracting the two columns. Next, we square the difference (D2). Then we find the sum of D2 Next, plug the numbers into the formula. Following our order of operations. We multiply the sum of D2 (18) by 6 = 108. Next, we tackle the bottom half of the formula by squaring n, which is 9, = 81. Subtract 1 from 81 to get 80. Multiply 80 by n (9) to get 720. Divide 108 by 720 to get .15. Subtract .15 from 1 to get .85.

Example #4: Spearman Correlation
Two movie raters each watched the same six movies. Is there are relationship between the raters’ rankings? Example Rater 1 Rater 2 1 6 2 4 3 5 Try this problem.

Example #4: Spearman - answers
Rater 1 Rater 2 1 6 2 4 3 5 D D2 -5 25 -2 4 1 3 9 5 Here are the answers.

Spearman rs (from SPSS)
Pearson r (from SPSS) Here you can see the SPSS output for both Pearson r and Spearman. In the first box, you can see that correlation between anxiety and noise is .87 (strong, positive correlation) and that it is statistically significant at the .001 level. Spearman rs (from SPSS)

Example #5: Pearson’s r Participant Motivation (X) Depression (Y)
Sketch a scatterplot. Calculate the correlation coefficient. Determine if it is statistically significant at the .05 level for a 2-tailed test. Write an APA format conclusion.

Inferential Statistics

Similar presentations

Presentation on theme: "Inferential Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Inferential Statistics

Similar presentations

Presentation on theme: "Inferential Statistics"— Presentation transcript:

Similar presentations

About project

Feedback