Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help.

Correlation and regression

Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help

 Last week we covered four types of non-parametric statistical tests  They make no assumptions about the data's characteristics.  Use if any of the three properties below are true:  (a) the data are not normally distributed (e.g. skewed);  (b) the data show in-homogeneity of variance;  (c) the data are measurements on an ordinal scale (can be ranked).

 Non-parametric tests make few assumptions about the distribution of the data being analyzed  They get around this by not using raw scores, but by ranking them: The lowest score get rank 1, the next lowest rank 2, etc.  Different from test to test how ranking is carried out, but same principle  The analysis is carried out on the ranks, not the raw data  Ranking data means we lose information – we do not know the distance between the ranks  This means that non-par tests are less powerful than par tests,  and that non-par tests are less likely to discover an effect in our data than par tests (increased chance of type II error)

Examples of parametric tests and their non-parametric equivalents: Parametric test: Non-parametric counterpart:  Pearson correlation Spearman's correlation  (No equivalent test) Chi-Square test  Independent-means t-test Mann-Whitney test  Dependent-means t-testWilcoxon test  One-way Independent Measures Analysis of Variance (ANOVA)Kruskal-Wallis test  One-way Repeated-Measures ANOVA Friedman's test

 Just like parametric tests, which non- parametric test to use depends on the experimental design (repeated measures or within groups), and the number of/level of IVs

 Mann-Whitney: Two conditions, two groups, each participant one score  Wilcoxon: Two conditions, one group, each participant two scores (one per condition)  Kruskal-Wallis: 3+ conditions, different people in all conditions, each participant one score  Friedman´s ANOVA: 3+ conditions, one group, each participant 3+ scores

Which nonparametric test? 1.Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed 1.Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams 2.Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone 3.Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. Consider: How many groups? How many levels of IV/conditions?

1. Differences in fear ratings for 3, 5 and 7-year olds in response to sinister noises from under their bed [3 groups, each one score, 2 conditions - Kruskal-Wallis]. 2. Effects of cheese, brussel sprouts, wine and curry on vividness of a person's dreams [one group, each 4 scores, 4 conditions - Friedman´s ANOVA]. 3. Number of people spearing their eardrums after enforced listening to Britney Spears, Beyonce, Robbie Williams and Boyzone [one group, each 4 scores, 4 conditions – Friedman´s ANOVA] 4. Pedestrians rate the aggressiveness of owners of different types of car. Group A rate Micra owners; group B rate 4x4 owners; group C rate Subaru owners; group D rate Mondeo owners. [4 groups, each one score – Kruskal-Wallis]

 We often want to know if there is a relationship between two variables  Do people who drive fast cars get into accidents more often?  Do students who give the teacher red apples get higher grades?  Do blondes have more fun?  Etc.

 Correlation coefficient: A succinct measure of the strength of the relationship between two variables (e.g. height and weight, age and reaction time, IQ and exam score).

 A correlation is a measure of the linear relationship between variables  Two variables can be related in different ways:  1) positively related: The faster the car, the more accidents  2) not related: Speed of the car does not matter on the amount of accidents  3) negatively related: The faster the car, the less accidents

 We describe the relationship between variables statistically by looking at two measures:  Covariance  Correlation coefficient  We represent relationships graphically using scatterplots  The simplest way to decide if two variables are associated is to evaluate if they covary  Recall: Variance of one variable is the average amount the scores in the sample vary from the mean – if variance is high, the scores in the sample are very different from the mean

 Low and high variance around the mean of a sample

 If we want to know whether two variables are related, we want to know if changes in the scores of one variable is met with similar changes in the other variable  Therefore, when one variable deviates from its mean, we would expect the SCORES of the other variable to deviate from its mean in a similar way  Example: We take 5 ppl, show them a commercial and measure how many packets of sweets they buy the week after  If the number of times a commercial was seen relates to how many packets of sweets that were bought relates, the scores should vary around the mean of the two samples in a similar way

Looks like A relationship exists

 How do we calculate the exact similarity between the pattern of difference in the two variables (samples)?  We calculate covariance  Step 1: multiply the difference between the scores and the mean in the two samples  Note that if the difference between the means and the two scores are both positive or both negative, we get a positive value (+ * + = + and - * - = +)  If the difference between the means and the two scores is negative and positive, we get a negative value (+ * - = -)

 Step 2: Divide with the sum of observations (scores) -1: N-1  Same equation as for calculating variance  Except that we multiply differences with the corresponding difference of the score in the other sample, rather than squaring the differences within one sample

 Positive covariance indicate that as one variable deviates from the mean, so does the other in the same direction.  Faster cars lead to more accidents  Negative covariance indicate that as one variables deviates from the mean, so does the other but in the opposite direction  Faster cars lead to less accidents

 Covariance however depend on the scale of measurement used – it is not an independent measure  To overcome this problem, we standardize the covariance – so covariance is comparable across all experiments, no matter what type of measure we use

 We do this by converting differences between scores and means into standard deviations  Recall: Any score can be expressed in terms of how many SD´s it is away from the mean (the z-score)  We therefore divide covariance with the SDs of both samples  Two samples, we need the SD from both of them to standardize the covariance

 This standardized covariance is known as the correlation coefficient  This is also called Pearsons correlation coefficient and is one of the most important formulas in statistics

 When we standardize covariance we end up with a value that lies between -1 and +1  If r = +1, we have a perfect positive relationship

+1 (perfect positive correlation: as X increases, so does Y):

 When we standardize covariance we end up with a value that lies between -1 and +1  If r = -1, we have a perfect negative relationship

Perfect negative correlation: As X increases, Y decreases, or vice versa

 If r = 0 there is no correlation between the two samples – changes in sample X are not associated with systematic changes in sample Y, or vice versa.  Recall that we can use correlation coefficient as a measure of effect size  An r of +/- 0.1 is a small effect, 0.3 medium effect and 0.5 large effect

 Before performing correlational analysis we plot a scatterplot to get an idea about how the variables covary  A scatterplot is a graph of the scores of one sample (variable) vs. the scores of another sample  Further variables can be included in a 3D plot.

 A scatterplot informs if:  There is a relationships between the variables  What kind of relationship it is  If any cases (scores) are markedly different – outliers – these cause problems  We normally plot the IV on the x-axis, and the DV on the y-axis

 A 2D scatterplot

 A 3D scatterplot

Using SPSS to obtain scatterplots: (a) simple scatterplot: Graphs > Legacy Dialogs > Scatter/Dot...

Using SPSS to obtain scatterplots: (a) simple scatterplot: Graphs > Chartbuilder 1. Pick Scatterdot 2. Drag "Simple scatter" icon into chart preview window. 3. Drag X and Y variables into x-axis and y-axis boxes in chart preview window

Using SPSS to obtain scatterplots: (b) scatterplot with regression line: Analyze > Regression > Curve Estimation... ”constant" is the intercept with y- axis, "b1" is the slope”

 Having visually looked at the data, we can conduct a correlation analysis in SPSS  Procedure in page 123 in chapter 4 of Field´s book in the compendium  Note: Two types of correlation: Bivariate and partial  Bivariate is correlation between two variables  Partial correlation is the same, but controlling for one or more additional variables

Using SPSS to obtain correlations: Analyze > Correlate > Bivariate...

 There are various types of correlation coefficient, for different purposes:  Pearson's "r": Used when both X and Y variables are (a) continuous; (b) (ideally) measurements on interval or ratio scales; (c) normally distributed - e.g. height, weight, I.Q.  Spearman's rho: In same circumstances as (1), except that data need only be on an ordinal scale - e.g. attitudes, personality scores. 

 r is a parametric test: the data have to have certain characteristics (parameters) before it can be used.  rho is a non-parametric test - less fussy about the nature of the data on which it is performed.  Both are dead easy to calculate in SPSS

 Calculating Pearson's r: a worked example: Is there a relationship between the number of parties a person gives each month, and the amount of flour they purchase from Møller-Mogens?

 Our algorithm for the correlation coefficient from before, slightly modified:

Month:Flour production (X): No. of parties (Y): X2X2 Y2Y2 XY A3775136956252775 B4178168160843198 C4888230477444224 D3280102464002560 E3678129660842808 F307190050412130 G4075160056253000 H4583202568893735 I3974152154762886 J3474115654762516 N=10ΣX = 382ΣY =776ΣX 2 = 14876ΣY 2 = 60444ΣXY = 29832

                      10 776 60444 10 382 14876 10  776382 29832 r 22 Using our values (from the bottom row of the table:) N=10ΣX = 382ΣY =776ΣX 2 = 14876ΣY 2 = 60444ΣXY = 29832

 7455.0 391.253 80.188 40.22660.283 80.188 r 60.602176044440.1459214876 20.2964329832 r       r is 0.75. This is a positive correlation: People who buy a lot of flour from Møller-Mogens also hold a lot of parties (and vice versa).

How to interpret the size of a correlation:  r 2 (r * r, “r-square”) is the "coefficient of determination". It tells us what proportion of the variation in the Y scores is associated with changes in X.  E.g., if r is 0.2, r 2 is 4% (0.2 * 0.2 = 0.04 = 4%).  Only 4% of the variation in Y scores is attributable to Y's relationship with X.  Thus, knowing a person's Y score tells you essentially nothing about what their X score might be!

 Our correlation of 0.75 gives an r 2 of 56%.  An r of 0.9, gives an r 2 of (0.9 * 0.9 =.81) = 81%.  Note that correlations become much stronger the closer they are to 1 (or -1).  Correlations of.6 or -.6 (r 2 = 36%) are much better than correlations of.3 or -.3 (r 2 = 9%), not merely twice as strong!

 We use Spearman´s correlation coefficient when the data hev violated parametric assumptions (e.g. non-normal distribution)  Spearman´s correlation coefficient works with ranking the data in the samples just like other non-parametric tests

 Spearman's rho measures the degree of monotonicity rather than linearity in the relationship between two variables - i.e., the extent to which there is some kind of change in X associated with changes in Y:  Hence, copes better than Pearson's r when the relationship is monotonic but non-linear - e.g.:

 Some pertinent notes on interpreting correlations:  Correlation does not imply causality:  X might cause Y.  Y might cause X.  Z (or a whole set of factors) might cause both X and Y.

 Factors affecting the size of a correlation coefficient:  1. Sample size and random variation:  The larger the sample, the more stable the correlation coefficient.  Correlations obtained with small samples are unreliable.

Conclusion: You need a large sample before you can be really sure that your sample r is an accurate reflection of the population r. Limits within which 80% of sample r's will fall, when the true (population) correlation is 0: Sample size:80% limits for r: 5-0.69 to +0.69 15-0.35 to +0.35 25-0.26 to +0.26 50-0.18 to +0.18 100-0.13 to +0.13 200-0.09 to +0.09

 2. Linearity of the relationship:  Pearson’s r measures the strength of the linear relationship between two variables; r will be misleading if there is a strong but non-linear relationship. e.g.:

3. Range of talent (variability):  The smaller the amount of variability in X and/or Y, the lower the apparent correlation. e.g.:

4. Homoscedasticity (equal variability):  r describes the average strength of the relationship between X and Y. Hence scores should have a constant amount of variability at all points in their distribution. in this region: low variability of Y (small Y-Y' ) in this region: high variability of Y (large Y-Y' ) regression line

5. Effect of discontinuous distributions:  A few outliers can distort things considerably. There is no real correlation between X and Y in the below case.

 Deciding what is a "good" correlation:  A moderate correlation could be due to either:  (a) sampling variation (and hence a "fluke"); or  (b) a genuine association between the variables concerned.  How can we tell which of these is correct?

Large negative correlations: unlikely to occur by chance Large positive correlations: unlikely to occur by chance r = 0 Small correlations: likely to occur by chance Distribution of r's obtained using samples drawn from two uncorrelated populations of scores: Distribution diagram for r values

r = - 0.44r = + 0.44 r = 0 0.025 For an N of 20: For a sample size of 20, 5 out of 100 random samples are likely to produce an r of 0.44 or larger, merely by chance (i.e., even though in the population, there was no correlation at all!)

 Thus we arbitrarily decide that:  (a) If our sample correlation is so large that it would occur by chance only 5 times in a hundred, we will assume that it reflects a genuine correlation in the population from which the sample came.  (b) If a correlation like ours is likely to occur by chance more often than this, we assume it has arisen merely by chance, and that it is not evidence for a correlation in the parent population.

 How do we know how likely it is to obtain a sample correlation as large as ours by chance?  Tables (on the website) give this information for different sample sizes.  An illustration of how to use these tables:  Suppose we take a sample of 20 people, and measure their eye-separation and back hairiness. Our sample r is.75. Does this reflect a true correlation between eye-separation and hairiness in the parent population, or has our r arisen merely by chance (i.e. because we have a freaky sample)?

Step 1:  Calculate the "degrees of freedom" (DF = the number of pairs of scores, minus 2).  Here, we have 20 pairs of scores, so DF = 18. Step 2:  Find a table of "critical values for Pearson's r".

Part of a table of "critical values for Pearson's r": Level of significance (two-tailed) df.05.01.001 17.4555.5751.6932 18.4438.5614.6787 19.4329.5487.6652 20.4227.5368.6524 With 18 df, a correlation of.4438 or larger will occur by chance with a probability of 0.05: I.e., if we took 100 samples of 20 people, about 5 of those samples are likely to produce an r of.4438 or larger (even though there is actually no correlation in the population!)

Part of a table of "critical values for Pearson's r": Level of significance (two-tailed) df.05.01.001 17.4555.5751.6932 18.4438.5614.6787 20.4227.5368.6524 With 18 df, a correlation of.5614 or larger will occur by chance with a probability of 0.01: i.e., if we took 100 samples of 20 people, about 1 of those 100 samples is likely to give an r of.5614 or larger (again, even though there is actually no correlation in the population!)

Part of a table of "critical values for Pearson's r": Level of significance (two-tailed) df.05.01.001 17.4555.5751.6932 18.4438.5614.6787 20.4227.5368.6524 With 18 df, a correlation of.6787 or larger will occur by chance with a probability of 0.001: i.e., if we took 1000 samples of 20 people, about 1 of those 1000 samples is likely to give an r of.6787 or larger (again, even though there is actually no correlation in the population!)

 The table shows that an r of.6787 is likely to occur by chance only once in a thousand times.  Our obtained r is.75. This is larger than.6787.  Hence our obtained r of.75 is likely to occur by chance less than one time in a thousand (p<0.001).

Conclusion:  Any sample correlation could in principle occur due to chance or because it reflects a true relationship in the population from which the sample was taken.  Because our r of.75 is so unlikely to occur by chance, we can safely assume that there really is a relationship between eye- separation and back-hairiness.

Important point:  Do not confuse statistical significance with practical importance.  We have just assessed "statistical significance" - the likelihood that our obtained correlation has arisen merely by chance  Our r of.75 is "highly significant" (i.e., highly unlikely to have arisen by chance).  However, a weak correlation can be statistically significant, if the sample size is large enough:  With 100 DF, an r of.1946 is "significant" in the sense that it is unlikely to have arisen by chance  (r's bigger than this will occur by chance only 5 in a 100 times)

 The coefficient of determination (r 2 ) shows that an r of 0.1946 this is not a strong relationship in a practical sense   r 2 = 0.1946 * 0.1946 = 0.0379 = 3.79%  Knowledge of one of the variables would account for only 3.79% of the variance in the other - completely useless for predictive purposes!

My scaly butt is of large size!

The relationship between two variables (e.g. height and weight; age and I.Q.) can be described graphically with a scatterplot short medium long y-axis: age (years) old medium young An individual's performance (each person supplies two scores, age and r.t.) x-axis: reaction time (msec)

We are often interested in seeing whether or not a linear relationship exists between two variables. Here, there is a strong positive relationship between reaction time and age:

Here is an equally strong but negative relationship between reaction time and age:

And here, there is no statistically significant relationship between RT and age:

 If we find a reasonably strong linear relationship between two variables, we might want to fit a straight line to the scatterplot.  There are two reasons for wanting to do this: (a) For description: The line acts as a succinct description of the "idealized" relationship between our two variables, a relationship which we assume the real data reflect somewhat imperfectly. (b) For prediction: We could use the line to obtain estimates of values for one of the variables, on the basis of knowledge of the value of the other variable (e.g. if we knew a person's height, we could predict their weight).

Linear Regression is an objective method of fitting a line to a scatterplot - better than trying to do it by eye! Which line is the best fit to the data?

The recipe for drawing a straight line: To draw a line, we need two values: (a) the intercept - the point at which the line intercepts the vertical axis of the graph (b) the slope of the line. same intercept, different slopes:different intercepts, same slope:

 The formula for a straight line:  Y = a + b * X  Y is a value on the vertical (Y) axis;  a is the intercept (the point at which the line intersects the vertical axis of the graph [Y-axis]);  b is the slope of the line;  X is any value on the horizontal (X) axis.

Linear regression step-by-step: 10 individuals do two tests: a stress test, and a statistics test. What is the relationship between stress and statistics performance? subject: stress (X) test score (Y) A1884 B3167 C2563 D2989 E2193 F3263 G4055 H3670 I3553 J2777

Draw a scatterplot to see what the data look like:

 There is a negative relationship between stress scores and statistics scores:  People who scored high on the statistics test tend to have low stress levels, and  People who scored low on the statistics test tend to have high stress levels.

 Calculating the regression line:  We need to find "a" (the intercept) and "b" (the slope) of the line.  Work out "b" first, and "a" second.

To calculate “b”, the b of the line:

subjects: X (stress) X 2 Y (test) XY A1818 2 = 324 84 18 * 84 = 1512 B3131 2 = 961 67 31 * 67 = 2077 C2525 2 = 625 63 25 * 63 = 1575 D2929 2 = 841 89 29 * 89 = 2581 E2121 2 = 441 93 21 * 93 = 1953 F3232 2 = 1024 63 32 * 63 = 2016 G4040 2 = 1600 55 40 * 55 = 2200 H3636 2 = 1296 70 36 * 70 = 2520 I3535 2 = 1225 53 35 * 53 = 1855 J2727 2 = 729 77 27 * 77 = 2079  X =  X 2 =  Y =  XY = 294 9066 714 20368

 We also need:  N = the number of pairs of scores, = 10 in this case.  (  X) 2 = "the sum of X squared" = 294 * 294 = 86436.  NB!:  (  X) 2 means "square the sum of X“: Add together all of the X values to get a total, and then square this total.   X 2 means "sum the squared X values“: Square each X value, and then add together these squared X values to get a total.

Working through the formula for b: 476.1 40.422 60.623   

b = -1.476. b is negative, because the regression line slopes downwards from left to right: As stress scores (X) increase, statistics scores (Y) decrease.

Now work out a: Y is the mean of the Y scores: = 71.4. X is the mean of the X scores: = 29.4. b = -1.476 Therefore a = 71.4 - (-1.476 * 29.4) = 114.80

 The complete regression equation now looks like: Y' = 114.80 + ( -1.476 * X)  To draw the line, input any three different values for X, in order to get associated values for Y'.  For X = 10, Y' = 114.80 + (-1.476 * 10) = 100.04  For X = 30, Y' = 114.80 + (-1.476 * 30) = 70.52  For X = 50, Y' = 114.80 + (-1.476 * 50) = 41.00

Regression line for predicting test scores (Y) from stress scores (X): stress score (X) test score (Y) 0 20 40 60 80 100 120 01020304050 Plot: X = 10, Y' = 100.04 X = 30, Y' = 70.52 X = 50, Y' = 41.00 intercept = 114.80

Important:  This is the regression line for predicting statistics test score on the basis of knowledge of a person's stress score; this is the "regression of Y on X".  To predict stress score on the basis of knowledge of statistics test score (the "regression of X on Y"), we can't use this regression line!

 To predict Y from X requires a line that minimizes the deviations of the predicted Y's from actual Y's.  To predict X from Y requires a line that minimizes the deviations of the predicted X's from actual X's - a different task (although somewhat similar)!  Solution: To calculate regression of X on Y, swap the column labels (so that the "X" values are now the "Y" values, and vice versa); and re-do the calculations.  So X is now test results, Y is now stress score

 Simple regression in SPSS: Page 155 in Field´s book in the compendium  More advanced types of regression handle non-linear regression  Also, multi-variate regression: Regression with more than two variables

Analyze -> Regression -> Linear Regression

Output in two tables: The first provides the R and R-square values, and the SE

 The second provides an ANOVA.  The ANOVA provides an estimate of whether the regression model is significantly better than if we used the mean values of the samples  I.e., whether our regression model predicts the variation in the DV significantly well or not

 Third table provides information on prediction  Here, we can see that an increase of 1 in X causes Y to increase 0.096  We see that a = 134.140 (intercept with Y-axis)

 This gives us the formula: Y = 134.14 + (0.09612*X)

Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help.

Similar presentations

Presentation on theme: "Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help.

Similar presentations

Presentation on theme: "Correlation and regression. Lecture  Correlation  Regression Exercise  Group tasks on correlation and regression  Free experiment supervision/help."— Presentation transcript:

Similar presentations

About project

Feedback