Download presentation
Presentation is loading. Please wait.
Published byWalter McDaniel Modified over 9 years ago
1
Correlation and regression
2
Lecture Correlation Regression Exercise Group tasks on correlation and regression Free experiment supervision/help
3
Last week we covered four types of non-parametric statistical tests They make no assumptions about the data's characteristics. Use if any of the three properties below are true: (a) the data are not normally distributed (e.g. skewed); (b) the data show in-homogeneity of variance; (c) the data are measurements on an ordinal scale (can be ranked).
4
Non-parametric tests make few assumptions about the distribution of the data being analyzed They get around this by not using raw scores, but by ranking them: The lowest score get rank 1, the next lowest rank 2, etc. Different from test to test how ranking is carried out, but same principle The analysis is carried out on the ranks, not the raw data Ranking data means we lose information – we do not know the distance between the ranks This means that non-par tests are less powerful than par tests, and that non-par tests are less likely to discover an effect in our data than par tests (increased chance of type II error)
5
Examples of parametric tests and their non-parametric equivalents: Parametric test: Non-parametric counterpart: Pearson correlation Spearman's correlation (No equivalent test) Chi-Square test Independent-means t-test Mann-Whitney test Dependent-means t-testWilcoxon test One-way Independent Measures Analysis of Variance (ANOVA)Kruskal-Wallis test One-way Repeated-Measures ANOVA Friedman's test
6
Just like parametric tests, which non- parametric test to use depends on the experimental design (repeated measures or within groups), and the number of/level of IVs
7
Mann-Whitney: Two conditions, two groups, each participant one score Wilcoxon: Two conditions, one group, each participant two scores (one per condition) Kruskal-Wallis: 3+ conditions, different people in all conditions, each participant one score Friedman´s ANOVA: 3+ conditions, one group, each participant 3+ scores
8
Which nonparametric test? 1.Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed 1.Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams 2.Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone 3.Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. Consider: How many groups? How many levels of IV/conditions?
9
1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed [3 groups, each one score, 2 conditions - Kruskal-Wallis]. 2. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams [one group, each 4 scores, 4 conditions - Friedman´s ANOVA]. 3. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone [one group, each 4 scores, 4 conditions – Friedman´s ANOVA] 4. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. [4 groups, each one score – Kruskal-Wallis]
11
We often want to know if there is a relationship between two variables Do people who drive fast cars get into accidents more often? Do students who give the teacher red apples get higher grades? Do blondes have more fun? Etc.
12
Correlation coefficient: A succinct measure of the strength of the relationship between two variables (e.g. height and weight, age and reaction time, IQ and exam score).
13
A correlation is a measure of the linear relationship between variables Two variables can be related in different ways: 1) positively related: The faster the car, the more accidents 2) not related: Speed of the car does not matter on the amount of accidents 3) negatively related: The faster the car, the less accidents
14
We describe the relationship between variables statistically by looking at two measures: Covariance Correlation coefficient We represent relationships graphically using scatterplots The simplest way to decide if two variables are associated is to evaluate if they covary Recall: Variance of one variable is the average amount the scores in the sample vary from the mean – if variance is high, the scores in the sample are very different from the mean
15
Low and high variance around the mean of a sample
16
If we want to know whether two variables are related, we want to know if changes in the scores of one variable is met with similar changes in the other variable Therefore, when one variable deviates from its mean, we would expect the SCORES of the other variable to deviate from its mean in a similar way Example: We take 5 ppl, show them a commercial and measure how many packets of sweets they buy the week after If the number of times a commercial was seen relates to how many packets of sweets that were bought relates, the scores should vary around the mean of the two samples in a similar way
17
Looks like A relationship exists
18
How do we calculate the exact similarity between the pattern of difference in the two variables (samples)? We calculate covariance Step 1: multiply the difference between the scores and the mean in the two samples Note that if the difference between the means and the two scores are both positive or both negative, we get a positive value (+ * + = + and - * - = +) If the difference between the means and the two scores is negative and positive, we get a negative value (+ * - = -)
19
Step 2: Divide with the sum of observations (scores) -1: N-1 Same equation as for calculating variance Except that we multiply differences with the corresponding difference of the score in the other sample, rather than squaring the differences within one sample
20
Positive covariance indicate that as one variable deviates from the mean, so does the other in the same direction. Faster cars lead to more accidents Negative covariance indicate that as one variables deviates from the mean, so does the other but in the opposite direction Faster cars lead to less accidents
21
Covariance however depend on the scale of measurement used – it is not an independent measure To overcome this problem, we standardize the covariance – so covariance is comparable across all experiments, no matter what type of measure we use
22
We do this by converting differences between scores and means into standard deviations Recall: Any score can be expressed in terms of how many SD´s it is away from the mean (the z-score) We therefore divide covariance with the SDs of both samples Two samples, we need the SD from both of them to standardize the covariance
23
This standardized covariance is known as the correlation coefficient This is also called Pearsons correlation coefficient and is one of the most important formulas in statistics
24
When we standardize covariance we end up with a value that lies between -1 and +1 If r = +1, we have a perfect positive relationship
25
+1 (perfect positive correlation: as X increases, so does Y):
26
When we standardize covariance we end up with a value that lies between -1 and +1 If r = -1, we have a perfect negative relationship
27
Perfect negative correlation: As X increases, Y decreases, or vice versa
28
If r = 0 there is no correlation between the two samples – changes in sample X are not associated with systematic changes in sample Y, or vice versa. Recall that we can use correlation coefficient as a measure of effect size An r of +/- 0.1 is a small effect, 0.3 medium effect and 0.5 large effect
30
Before performing correlational analysis we plot a scatterplot to get an idea about how the variables covary A scatterplot is a graph of the scores of one sample (variable) vs. the scores of another sample Further variables can be included in a 3D plot.
31
A scatterplot informs if: There is a relationships between the variables What kind of relationship it is If any cases (scores) are markedly different – outliers – these cause problems We normally plot the IV on the x-axis, and the DV on the y-axis
32
A 2D scatterplot
33
A 3D scatterplot
34
Using SPSS to obtain scatterplots: (a) simple scatterplot: Graphs > Legacy Dialogs > Scatter/Dot...
35
Using SPSS to obtain scatterplots: (a) simple scatterplot: Graphs > Chartbuilder 1. Pick Scatterdot 2. Drag "Simple scatter" icon into chart preview window. 3. Drag X and Y variables into x-axis and y-axis boxes in chart preview window
36
Using SPSS to obtain scatterplots: (b) scatterplot with regression line: Analyze > Regression > Curve Estimation... ”constant" is the intercept with y- axis, "b1" is the slope”
37
Having visually looked at the data, we can conduct a correlation analysis in SPSS Procedure in page 123 in chapter 4 of Field´s book in the compendium Note: Two types of correlation: Bivariate and partial Bivariate is correlation between two variables Partial correlation is the same, but controlling for one or more additional variables
38
Using SPSS to obtain correlations: Analyze > Correlate > Bivariate...
39
There are various types of correlation coefficient, for different purposes: Pearson's "r": Used when both X and Y variables are (a) continuous; (b) (ideally) measurements on interval or ratio scales; (c) normally distributed - e.g. height, weight, I.Q. Spearman's rho: In same circumstances as (1), except that data need only be on an ordinal scale - e.g. attitudes, personality scores.
40
r is a parametric test: the data have to have certain characteristics (parameters) before it can be used. rho is a non-parametric test - less fussy about the nature of the data on which it is performed. Both are dead easy to calculate in SPSS
42
Calculating Pearson's r: a worked example: Is there a relationship between the number of parties a person gives each month, and the amount of flour they purchase from Møller-Mogens?
43
Our algorithm for the correlation coefficient from before, slightly modified:
44
Month:Flour production (X): No. of parties (Y): X2X2 Y2Y2 XY A3775136956252775 B4178168160843198 C4888230477444224 D3280102464002560 E3678129660842808 F307190050412130 G4075160056253000 H4583202568893735 I3974152154762886 J3474115654762516 N=10ΣX = 382ΣY =776ΣX 2 = 14876ΣY 2 = 60444ΣXY = 29832
45
10 776 60444 10 382 14876 10 776382 29832 r 22 Using our values (from the bottom row of the table:) N=10ΣX = 382ΣY =776ΣX 2 = 14876ΣY 2 = 60444ΣXY = 29832
46
7455.0 391.253 80.188 40.22660.283 80.188 r 60.602176044440.1459214876 20.2964329832 r r is 0.75. This is a positive correlation: People who buy a lot of flour from Møller-Mogens also hold a lot of parties (and vice versa).
47
How to interpret the size of a correlation: r 2 (r * r, “r-square”) is the "coefficient of determination". It tells us what proportion of the variation in the Y scores is associated with changes in X. E.g., if r is 0.2, r 2 is 4% (0.2 * 0.2 = 0.04 = 4%). Only 4% of the variation in Y scores is attributable to Y's relationship with X. Thus, knowing a person's Y score tells you essentially nothing about what their X score might be!
48
Our correlation of 0.75 gives an r 2 of 56%. An r of 0.9, gives an r 2 of (0.9 * 0.9 =.81) = 81%. Note that correlations become much stronger the closer they are to 1 (or -1). Correlations of.6 or -.6 (r 2 = 36%) are much better than correlations of.3 or -.3 (r 2 = 9%), not merely twice as strong!
50
We use Spearman´s correlation coefficient when the data hev violated parametric assumptions (e.g. non-normal distribution) Spearman´s correlation coefficient works with ranking the data in the samples just like other non-parametric tests
51
Spearman's rho measures the degree of monotonicity rather than linearity in the relationship between two variables - i.e., the extent to which there is some kind of change in X associated with changes in Y: Hence, copes better than Pearson's r when the relationship is monotonic but non-linear - e.g.:
53
Some pertinent notes on interpreting correlations: Correlation does not imply causality: X might cause Y. Y might cause X. Z (or a whole set of factors) might cause both X and Y.
54
Factors affecting the size of a correlation coefficient: 1. Sample size and random variation: The larger the sample, the more stable the correlation coefficient. Correlations obtained with small samples are unreliable.
55
Conclusion: You need a large sample before you can be really sure that your sample r is an accurate reflection of the population r. Limits within which 80% of sample r's will fall, when the true (population) correlation is 0: Sample size:80% limits for r: 5-0.69 to +0.69 15-0.35 to +0.35 25-0.26 to +0.26 50-0.18 to +0.18 100-0.13 to +0.13 200-0.09 to +0.09
56
2. Linearity of the relationship: Pearson’s r measures the strength of the linear relationship between two variables; r will be misleading if there is a strong but non-linear relationship. e.g.:
57
3. Range of talent (variability): The smaller the amount of variability in X and/or Y, the lower the apparent correlation. e.g.:
58
4. Homoscedasticity (equal variability): r describes the average strength of the relationship between X and Y. Hence scores should have a constant amount of variability at all points in their distribution. in this region: low variability of Y (small Y-Y' ) in this region: high variability of Y (large Y-Y' ) regression line
59
5. Effect of discontinuous distributions: A few outliers can distort things considerably. There is no real correlation between X and Y in the below case.
60
Deciding what is a "good" correlation: A moderate correlation could be due to either: (a) sampling variation (and hence a "fluke"); or (b) a genuine association between the variables concerned. How can we tell which of these is correct?
61
Large negative correlations: unlikely to occur by chance Large positive correlations: unlikely to occur by chance r = 0 Small correlations: likely to occur by chance Distribution of r's obtained using samples drawn from two uncorrelated populations of scores: Distribution diagram for r values
62
r = - 0.44r = + 0.44 r = 0 0.025 For an N of 20: For a sample size of 20, 5 out of 100 random samples are likely to produce an r of 0.44 or larger, merely by chance (i.e., even though in the population, there was no correlation at all!)
63
Thus we arbitrarily decide that: (a) If our sample correlation is so large that it would occur by chance only 5 times in a hundred, we will assume that it reflects a genuine correlation in the population from which the sample came. (b) If a correlation like ours is likely to occur by chance more often than this, we assume it has arisen merely by chance, and that it is not evidence for a correlation in the parent population.
64
How do we know how likely it is to obtain a sample correlation as large as ours by chance? Tables (on the website) give this information for different sample sizes. An illustration of how to use these tables: Suppose we take a sample of 20 people, and measure their eye-separation and back hairiness. Our sample r is.75. Does this reflect a true correlation between eye-separation and hairiness in the parent population, or has our r arisen merely by chance (i.e. because we have a freaky sample)?
65
Step 1: Calculate the "degrees of freedom" (DF = the number of pairs of scores, minus 2). Here, we have 20 pairs of scores, so DF = 18. Step 2: Find a table of "critical values for Pearson's r".
66
Part of a table of "critical values for Pearson's r": Level of significance (two-tailed) df.05.01.001 17.4555.5751.6932 18.4438.5614.6787 19.4329.5487.6652 20.4227.5368.6524 With 18 df, a correlation of.4438 or larger will occur by chance with a probability of 0.05: I.e., if we took 100 samples of 20 people, about 5 of those samples are likely to produce an r of.4438 or larger (even though there is actually no correlation in the population!)
67
Part of a table of "critical values for Pearson's r": Level of significance (two-tailed) df.05.01.001 17.4555.5751.6932 18.4438.5614.6787 20.4227.5368.6524 With 18 df, a correlation of.5614 or larger will occur by chance with a probability of 0.01: i.e., if we took 100 samples of 20 people, about 1 of those 100 samples is likely to give an r of.5614 or larger (again, even though there is actually no correlation in the population!)
68
Part of a table of "critical values for Pearson's r": Level of significance (two-tailed) df.05.01.001 17.4555.5751.6932 18.4438.5614.6787 20.4227.5368.6524 With 18 df, a correlation of.6787 or larger will occur by chance with a probability of 0.001: i.e., if we took 1000 samples of 20 people, about 1 of those 1000 samples is likely to give an r of.6787 or larger (again, even though there is actually no correlation in the population!)
69
The table shows that an r of.6787 is likely to occur by chance only once in a thousand times. Our obtained r is.75. This is larger than.6787. Hence our obtained r of.75 is likely to occur by chance less than one time in a thousand (p<0.001).
70
Conclusion: Any sample correlation could in principle occur due to chance or because it reflects a true relationship in the population from which the sample was taken. Because our r of.75 is so unlikely to occur by chance, we can safely assume that there really is a relationship between eye- separation and back-hairiness.
71
Important point: Do not confuse statistical significance with practical importance. We have just assessed "statistical significance" - the likelihood that our obtained correlation has arisen merely by chance Our r of.75 is "highly significant" (i.e., highly unlikely to have arisen by chance). However, a weak correlation can be statistically significant, if the sample size is large enough: With 100 DF, an r of.1946 is "significant" in the sense that it is unlikely to have arisen by chance (r's bigger than this will occur by chance only 5 in a 100 times)
72
The coefficient of determination (r 2 ) shows that an r of 0.1946 this is not a strong relationship in a practical sense r 2 = 0.1946 * 0.1946 = 0.0379 = 3.79% Knowledge of one of the variables would account for only 3.79% of the variance in the other - completely useless for predictive purposes!
73
My scaly butt is of large size!
75
The relationship between two variables (e.g. height and weight; age and I.Q.) can be described graphically with a scatterplot short medium long y-axis: age (years) old medium young An individual's performance (each person supplies two scores, age and r.t.) x-axis: reaction time (msec)
76
We are often interested in seeing whether or not a linear relationship exists between two variables. Here, there is a strong positive relationship between reaction time and age:
77
Here is an equally strong but negative relationship between reaction time and age:
78
And here, there is no statistically significant relationship between RT and age:
79
If we find a reasonably strong linear relationship between two variables, we might want to fit a straight line to the scatterplot. There are two reasons for wanting to do this: (a) For description: The line acts as a succinct description of the "idealized" relationship between our two variables, a relationship which we assume the real data reflect somewhat imperfectly. (b) For prediction: We could use the line to obtain estimates of values for one of the variables, on the basis of knowledge of the value of the other variable (e.g. if we knew a person's height, we could predict their weight).
80
Linear Regression is an objective method of fitting a line to a scatterplot - better than trying to do it by eye! Which line is the best fit to the data?
81
The recipe for drawing a straight line: To draw a line, we need two values: (a) the intercept - the point at which the line intercepts the vertical axis of the graph (b) the slope of the line. same intercept, different slopes:different intercepts, same slope:
82
The formula for a straight line: Y = a + b * X Y is a value on the vertical (Y) axis; a is the intercept (the point at which the line intersects the vertical axis of the graph [Y-axis]); b is the slope of the line; X is any value on the horizontal (X) axis.
83
Linear regression step-by-step: 10 individuals do two tests: a stress test, and a statistics test. What is the relationship between stress and statistics performance? subject: stress (X) test score (Y) A1884 B3167 C2563 D2989 E2193 F3263 G4055 H3670 I3553 J2777
84
Draw a scatterplot to see what the data look like:
85
There is a negative relationship between stress scores and statistics scores: People who scored high on the statistics test tend to have low stress levels, and People who scored low on the statistics test tend to have high stress levels.
86
Calculating the regression line: We need to find "a" (the intercept) and "b" (the slope) of the line. Work out "b" first, and "a" second.
87
To calculate “b”, the b of the line:
88
subjects: X (stress) X 2 Y (test) XY A1818 2 = 324 84 18 * 84 = 1512 B3131 2 = 961 67 31 * 67 = 2077 C2525 2 = 625 63 25 * 63 = 1575 D2929 2 = 841 89 29 * 89 = 2581 E2121 2 = 441 93 21 * 93 = 1953 F3232 2 = 1024 63 32 * 63 = 2016 G4040 2 = 1600 55 40 * 55 = 2200 H3636 2 = 1296 70 36 * 70 = 2520 I3535 2 = 1225 53 35 * 53 = 1855 J2727 2 = 729 77 27 * 77 = 2079 X = X 2 = Y = XY = 294 9066 714 20368
89
We also need: N = the number of pairs of scores, = 10 in this case. ( X) 2 = "the sum of X squared" = 294 * 294 = 86436. NB!: ( X) 2 means "square the sum of X“: Add together all of the X values to get a total, and then square this total. X 2 means "sum the squared X values“: Square each X value, and then add together these squared X values to get a total.
90
Working through the formula for b: 476.1 40.422 60.623
91
b = -1.476. b is negative, because the regression line slopes downwards from left to right: As stress scores (X) increase, statistics scores (Y) decrease.
92
Now work out a: Y is the mean of the Y scores: = 71.4. X is the mean of the X scores: = 29.4. b = -1.476 Therefore a = 71.4 - (-1.476 * 29.4) = 114.80
93
The complete regression equation now looks like: Y' = 114.80 + ( -1.476 * X) To draw the line, input any three different values for X, in order to get associated values for Y'. For X = 10, Y' = 114.80 + (-1.476 * 10) = 100.04 For X = 30, Y' = 114.80 + (-1.476 * 30) = 70.52 For X = 50, Y' = 114.80 + (-1.476 * 50) = 41.00
94
Regression line for predicting test scores (Y) from stress scores (X): stress score (X) test score (Y) 0 20 40 60 80 100 120 01020304050 Plot: X = 10, Y' = 100.04 X = 30, Y' = 70.52 X = 50, Y' = 41.00 intercept = 114.80
95
Important: This is the regression line for predicting statistics test score on the basis of knowledge of a person's stress score; this is the "regression of Y on X". To predict stress score on the basis of knowledge of statistics test score (the "regression of X on Y"), we can't use this regression line!
96
To predict Y from X requires a line that minimizes the deviations of the predicted Y's from actual Y's. To predict X from Y requires a line that minimizes the deviations of the predicted X's from actual X's - a different task (although somewhat similar)! Solution: To calculate regression of X on Y, swap the column labels (so that the "X" values are now the "Y" values, and vice versa); and re-do the calculations. So X is now test results, Y is now stress score
98
Simple regression in SPSS: Page 155 in Field´s book in the compendium More advanced types of regression handle non-linear regression Also, multi-variate regression: Regression with more than two variables
99
Slide 99 Analyze -> Regression -> Linear Regression
100
Slide 100 Output in two tables: The first provides the R and R-square values, and the SE
101
The second provides an ANOVA. The ANOVA provides an estimate of whether the regression model is significantly better than if we used the mean values of the samples I.e., whether our regression model predicts the variation in the DV significantly well or not
102
Third table provides information on prediction Here, we can see that an increase of 1 in X causes Y to increase 0.096 We see that a = 134.140 (intercept with Y-axis)
103
This gives us the formula: Y = 134.14 + (0.09612*X)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.