Linear Regression:
The relationship between two variables (e.g. height and weight; age and IQ) can be described graphically with a scatterplot : shortmediumlong y-axis: age (years) old medium young An individual's performance (each person supplies two scores, age and RT) x-axis: reaction time (msec)
Often in psychology, we are interested in seeing whether or not a linear relationship exists between two variables. Here, there is a strong positive relationship between RT and age:
Here is an equally strong but negative relationship between RT and age:
Here, there is no relationship between RT and age:
If we find a reasonably strong linear relationship between two variables, we might want to fit a straight line to the scatterplot. There are two reasons for wanting to do this: (a) for description: the line acts as a succinct description of the "idealised" relationship between our two variables, a relationship which we assume the real data reflect somewhat imperfectly. (b) for prediction: we could use the line to obtain estimates of values for one of the variables, on the basis of knowledge of the value of the other variable (e.g. if we knew a person's height, we could predict their weight).
Linear Regression is an objective method of fitting a line to our scatterplot - better than trying to do it by eye! Which line is the best fit to the data?
The recipe for drawing a straight line: To draw a line, we need two values: (a) the intercept - the point at which the line intercepts the vertical axis of the graph; (b) the slope of the line. same intercept, different slopes:different intercepts, same slope:
The formula for a straight line: Y = a + b * X Y is a value on the vertical (Y) axis; a is the intercept (the point at which the line intersects the vertical axis of the graph); b is the slope of the line; X is any value on the horizontal (X) axis.
Linear regression step-by-step: 10 individuals do two tests: a stress test, and a statistics test. What is the relationship between stress and statistics performance? subject: stress (X) test score (Y) A1884 B3167 C2563 D2989 E2193 F3263 G4055 H3670 I3553 J2777
Draw a scatterplot to see what the data look like:
There is a negative relationship between stress scores and statistics scores: people who scored high on the statistics test tend to have low stress levels, and people who scored low on the statistics test tend to have high stress levels.
Calculating the regression line: We need to find "a" (the intercept) and "b" (the slope) of the line. Work out "b" first, and "a" second.
To calculate “b”, the slope of the line:
stress test subject: X X 2 YXY A = * 84 = 1512 B = * 67 = 2077 C = * 63 = 1575 D = * 89 = 2581 E = * 93 = 1953 F = * 63 = 2016 G = * 55 = 2200 H = * 70 = 2520 I = * 53 = 1855 J = * 77 = 2079 X = X 2 = Y = XY =
We also need: N = the number of pairs of scores, = 10 in this case. ( X) 2 = "the sum of X squared" = 294 * 294 = NB: ( X) 2 means "square the sum of X"; add together all of the X values to get a total, and then square this total. X 2 means "sum the squared X values"; square each X value, and then add together these squared X values to get a total.
Working through the formula for b:
b = b is negative, because the regression line slopes downwards from left to right: as stress scores (X) increase, statistics scores (Y) decrease.
Now work out a: Y is the mean of the Y scores: = X is the mean of the X scores: = b = Therefore a = ( * 29.4) =
The complete regression equation: Y' = ( * X) To draw the line, input any three different values for X, in order to get associated values for Y'. For X = 10, Y' = ( * 10) = For X = 30, Y' = ( * 30) = For X = 50, Y' = ( * 50) =
Regression line for predicting test scores (Y) from stress scores (X): stress score (X) test score (Y) Plot: X = 10, Y' = X = 30, Y' = X = 50, Y' = intercept =
This is the regression line for predicting test score on the basis of knowledge of a person's stress score; this is the "regression of Y on X". To predict stress score on the basis of knowledge of test score (the "regression of X on Y"), we can't use this regression line! To predict Y from X requires a line that minimises the deviations of the predicted Y's from actual Y's. To predict X from Y requires a line that minimises the deviations of the predicted X's from actual X's - a different task! Solution: to calculate regression of X on Y, swap the column labels (so that the "X" values are now the "Y" values, and vice versa); and re-do the calculations.
Regression lines for predicting stress score from test score, and vice versa: Y' = ( * X)Y' = ( * X) (The previous graph redrawn, so that in both cases the predicted variable is on the vertical axis of the graph)
Linear Regression using SPSS: Analyze... > Regression... > Curve Estimation
b, the slope a, the intercept R 2 : how much variation in test score is accounted for by its relationship with stress? ANOVA: is our regression any better at predicting test score than simply using the mean test score?