PY1PR2 lecture 3: Simple regression

PY1PR2 lecture 3: Simple regression
Dr David Field

Summary Andy Field covers simple regression in the first half of chapter 7 Regression is a technique you can use when you want to predict the value of a variable from the value of another variable Relationship between correlation and regression Finding the “line of best fit” Interpreting regression results and using them to make predictions Assessing how good the fit is

Regression vs. Correlation
The three highlighted scatter plots all show examples of perfect positive correlation But there is an obvious difference between them the slope Simple regression assesses the slope of the relationship between two variables, while correlation assesses how strong / tight the relationship is

How regression measures the slope
Make the assumption that the slope or relationship can be described by a straight line if the scatter plot of two variables shows a curved relationship you can’t use regression Find the “best fitting” line, and then the gradient of that line is the measure of the slope There are an infinite number of possible straight lines you could draw on the scatter plot of two variables regression includes a method of finding the best one

Example data Today’s lecture, and the regression workshop, will make use of some real data to illustrate how simple regression is used to explore the relationship between variables The sample is a sample of rich, developed nations The variables are average annual income per head income inequality % of population suffering from any form of mental illness during the past year (WHO world mental health survey) life expectancy (workshop) Source: Wilkinson & Pickett (2009) The Spirit Level Note that although I am using examples where the data points come from countries, the procedures of regression are exactly the same when the data points come from individual people, as is usual in Psychology Until recently it was hard to compare levels of mental illness between different countries because nobody had collected strictly comparable data, but recently the World Health Organisation has established world mental health surveys that are starting to provide data. They show that different societies have very different levels of mental illness. In some countries only 5 or 10% of the adult population has suffered from any mental illness in the past year, but in the USA more than 25% have.

Variation between the per capita incomes of rich countries

Income Gap – the 20:20 ratio Average annual income is much larger in some developed countries than others Countries also differ in terms of how wide the spread around the average income In some countries there is a great deal of variation around the mean (large SD) In other countries there is a small SD Economists think about this in terms of inequality How rich is rich? Subjectively, in some countries, rich means double the average income In other countries rich means 4 times the averages income Quantifying inequality as an index The top 20% of earners in a country are defined as “rich” The bottom 20% are defined as “poor” The mean income of the top 20% is then divided by the mean income of the bottom 20% producing an “inequality ratio” One way (the 20:20 ratio) is to compare how much richer the top 20% of people are, compared to the bottom 20%. Among the rich developed countries the 20:20 ratio varies from as little as 3 or 4 to as much as 8 or 9. For example, in Japan and Sweden the income gap is fairly small: the richest 20% are less than 4 times as rich as the poorest 20%; but in Britain the richest 20% are over 7 times as rich as the poorest 20%, and in the USA they are over 8 times as rich.

If we assume for a minute that within a country, incomes are normally distributed around the mean income, then we can use our old friend the Standard Normal Distribution to see that the top 20% of earners have z scores of > 0.9, and the bottom 20 have incomes with z scores less than I obtained the value of -0.9 b by looking up the z score that produces an area under the curve of 20%. In fact, the distribution would be positively skewed because incomes can get very big, but they can’t get less than zero. To cacluate the 20:20 ratio you take the mean of the incomes to the left of the left hand line, and the mean of the incomes to the right of the right hand line. You then divide the latter big number by the former smaller number. The resulting ratio will be similar to the SD, as it is basically a measure of how much the data vary from the mean, but it probably places more emphasis on the extremes than the SD?

Income and inequality as predictor variables
Wilson & Pickett (2009) analysed the relationship between income, income inequality, and the prevalence of a range of health and social problems in different countries e.g. homicide rate, imprisonment rate, teenage birth rate, social mobility To illustrate linear regression, we will focus on the relationship between income, income inequality, and a psychological variable: % of population suffering from any form of mental illness during the past year Life expectancy (reverse coded) Teenage births Obesity Mental health Homicides Imprisonment (log transformed) Mistrust Social mobility Education (reverse coded) Infant mortality rate

The scatter plot indicates that mental health problems are more prevalent in more unequal societies
To quantify the relationship with regression, the first step is to find the “line of best fit”

The line of best fit can be estimated by eye
This line is obviously a poor summary of the trend on the graph The slope of the line is much too steep

This line looks like it captures the relationship fairly well

The solid red line captures the relationship fairly well
But possibly the green dotted line does a better job Regression uses a mathematical technique to decide which of all the possible lines is the best fitting line

The method of least squares
7.2.2 The method of least squares The red line on the left and the blue line on the right both have the same slope, but the blue line has been moved up relative to the red line so that it would intercept the y axis at a higher value. Which of the two lines give a better fit to the data? To asses this, regression begins by calculating a set of deviation scores, or residuals, for the line on the left. Some of the residuals will be negative and others positive. This is just the same as the first step in calculating the covariance, as described in lecture 2 on correlation. Logically, if the sum of the residuals is small, then the line is a good fit, whereas if the sum of the residuals is large, the line is a bad fit. Because positive an negative numbers will tend to cancel each other out and sum to zero, as for covariance, we square the difference scores before adding them up. This ensures the sum is positive and emphasizes the effect of data points that are far from the line. Once we have the sum of squares for the red line on the left, we can calculate the sum of squares for the blue line on the right. The line with the smaller sum of squares is the better fitting line. The method of least squares can be used to decide which of these two lines provides a better model of the data

The red line on the left and the blue line on the right both have the same slope, but the blue line has been moved up relative to the red line so that it would intercept the y axis at a higher value. Which of the two lines give a better fit to the data? To asses this, regression begins by calculating a set of deviation scores, or residuals, for the line on the left. Some of the residuals will be negative and others positive. This is just the same as the first step in calculating the covariance, as described in lecture 2 on correlation. Logically, if the sum of the residuals is small, then the line is a good fit, whereas if the sum of the residuals is large, the line is a bad fit. Because positive and negative numbers will tend to cancel each other out and sum to zero, as for covariance, we square the difference scores before adding them up. This ensures the sum is positive and emphasizes the effect of data points that are far from the line. Once we have the sum of squares for the red line on the left, we can calculate the sum of squares for the blue line on the right. The line with the smaller sum of squares is the better fitting line. Calculate the difference between each data point and the line, in terms of the predicted variable (mental health problems) Positive numbers mean the model has overestimated, negative numbers mean it has underestimated To measure the fit, square the difference scores and sum them (why square the difference scores?)

The red line on the left and the blue line on the right both have the same slope, but the blue line has been moved up relative to the red line so that it would intercept the y axis at a higher value. Which of the two lines give a better fit to the data? To asses this, regression begins by calculating a set of deviation scores, or residuals, for the line on the left. Some of the residuals will be negative and others positive. This is just the same as the first step in calculating the covariance, as described in lecture 2 on correlation. Logically, if the sum of the residuals is small, then the line is a good fit, whereas if the sum of the residuals is large, the line is a bad fit. Because positive an negative numbers will tend to cancel each other out and sum to zero, as for covariance, we square the difference scores before adding them up. This ensures the sum is positive and emphasizes the effect of data points that are far from the line. Once we have the sum of squares for the red line on the left, we can calculate the sum of squares for the blue line on the right. The line with the smaller sum of squares is the better fitting line. Next, calculate the sum of squared differences from the line on the right Whichever line has a smaller total squared difference score is a better model of the data There are always an infinite number of lines to compare, so regression uses a mathematical technique to find the one that minimizes the squared differences

Describing the line of best fit
7.2.1 Once you have obtained the line of best fit it can be drawn on a scatter plot, and it can be described by two numbers the first number, called the intercept or b0, is the value of the predicted variable (e.g. mental health) where predictor has a value of 0, and the line crosses the vertical axes of the graph the second number, called b1, is the gradient or slope of the line, and tells you what happens to the y axis value of the best fit line when you increase the value on the x axis by 1 unit Together, they are known as the regression coefficients b is really the greek letter beta, but SPSS has started using the english letter b instead. The slope is expressed as how much the value of the line changes when the value of the predictor (x) increases by 1.

10 The value of the intercept, b0, for both lines is 8 The solid red line has a positive value of b1 (gradient or slope) For the dashed line, b1 is a negative number, indicating that as the predictor increases the predicted value decreases

These 3 lines all have the same positive slope value, b1
But each has a different intercept, b0

The line of best fit as a “model” of the data
2.2 A statistical model is a way of describing the most important aspects of a set of data that is simpler than the data itself a straight line is simpler than the scatter plot it summarizes Like all statistical models, the line of best fit can be used to predict values of the outcome for a specific value of the predictor outcome = b0 + (b1 * specific value of predictor) see later for examples

Best fit line for prediction of mental health by inequality
The b0 and b1 have the same units as the predicted variable, % in this case b1 is 3.7% per unit of the predictor variable In other words, moving from an inequality ratio of 3 to 4 increases the rate of mental illness by 3.7% b1 = 3.7 b0 = 6.6 Note that I have re-expressed the inequality ratio relative to the inequality ratio of Japan. To do this I simply subtracted the value for Japan from all the values, i.e., I subtracted 3.4 from all of them. So Japan itself now has a value of 0. I did this because of a problem caused by the regression analysis wanting to assume that the predictor can take a value of zero. But inequality is a ratio with an absolute minimum possible value of 1 (perfectly implemented communism?!). Without my adjustment, the value of b0 was negative, which is obviously meaningless because you can’t have a negative quantity of mental illness. Don’t worry, this is a technicality, and something you will not be examined on. The problem of meaningless b0 values is common with psychological variables in regression, and generally you can just ignore b0 if it is meaningless. You have to think about b0 on a case by case basis when doing regression.

Using the model to predict values
7.4.3 Data for the % of population suffering from any form of mental illness during the past year was not available for Greece but we do know that Greece has an income inequality ration of 6.19, which is 3.28 units higher than the most equal country, Japan We can use the formula for straight lines, combined with the values of b0 and b1 to predict the level of mental illness in Greece b0 + (b1 * 3.28) 6.6 + (3.28 * 3.7) = 18.74%

You can also check the predicted value for Greece graphically
draw a line up from the x axis to meet the regression line at the point corresponding to the predictor value for Greece draw a line across to the y axis and read off the predicted value

Making predictions for extreme values of the predictor
Imagine we obtained a measurement of income inequality two new countries a capitalist country with no welfare state and zero taxation, inequality ratio 18 (much higher than any country in our sample) a communist country, inequality ratio 1.5 (les than half the most equal country in our sample, Japan) We could use the equation for straight lines to predict levels of mental illness for both countries Doing this is referred to as “extrapolation” Can you think of any problems with doing this, or reasons for caution? To make predictions for values of the predictor variable that lie outside the sampled range, you have to make the assumption that the relationship of the predictor and predicted remains linear beyond the range of values you have sampled. Often, you have no way of knowing that this assumption is true. In the example, it is quite likely that the artificial nature of a communist system would result in an increase rather than a decrease in mental health problems.

How well does the model fit?
7.2.3 Regression is guaranteed to find the best fitting straight line, but it might still be a poor fit if the two variables are only weakly related There are two ways of assessing the model R2, the proportion of the total variance in the predicted variable that the best fitting line accounts for a null hypothesis test: what is the probability of obtaining the b1 value in the sample if the true value of b1 in the population is zero? a b1 of 0 means that as the value of the predictor increases the value of the predicted stays the same (as inequality increases mental illness stays the same) Let’s compare the fit of two models that predict the proportion of the population with mental health problems R2 is similar to the coefficient of determination for correlation in lecture 2.

Predicting mental health from income
b1 = 0.65 The b1 value of 0.65 is interpreted as “for every extra 1000 dollars of income, the percentage of people in a country suffering mental health problems increases by 0.65%” The b0 value should probably not be interpreted theoretically, even if we think of it as 0 rather than an (impossible) negative number. Here’s why: The sample contains a minimum income of about 20,000 US dollars, and the b0 value is the predicted % of the popn suffering mental health problems when income is zero. Obviously, this is extrapolating way beyond the range of the predictor variable that we have sampled, and we cannot be certain that the linear relationship with the predicted variable will hold up for that range of the predictor. It’s quite likely that in very poor countries mental health problems are common? On the other hand, these data could be taken to suggest that in a primitive society without money mental illness would be virtually non-existent! This is obviously a speculative point of view, so we would need more evidence to support it (i.e. mental health data for countries with almost zero income per year) b0 = -1.7 (zero income = less than zero mental health problems?!?!)

Assessing the two models of mental health: R2
To assess the line of best fit, it is compared to an even simpler model of the predicted variable If I had mental health data for the 11 countries on the scatter plot, but no data about income or inequality, and I was asked to predict the level of mental health problems in Greece, my best guess would be the mean of the mental health data The simplest possible model of the predicted variable is its mean Calculate the sum of the squared deviations from the mean (total sum of squares) Calculate the sum of the squared deviations from the line of best fit (residual sum of squares) If the line is a good model, the residual sum should be much smaller than the total sum

The line of best fit versus the mean
mean of Y The mean is a slightly better model for Italy. For all the others inequality is a better model. This is a graphical representation of the total sum of squares and the residual sum of squares. Mathematically, you square the difference scores and then add them up to arrive at the two sums of squares you need Can you find any countries where the mean is a better model of mental health problems than the line of best fit for inequality?

The line of best fit versus the mean
The mean is a better model of mental health than income for New Zealand, and it is also a bit better for Belgium. But, there are also lots of countries where the mean is about as good a predictor as the income model. Income is only doing quite a bit better than the mean for USA and Spain. Can you find any countries where the mean is a better model of mental health problems than the line of best fit for income? For how many countries is the model obviously better than the mean?

Calculating R2 R2 is a descriptive statistic that describes how much better the model is at explaining variation in the predicted variable than using the mean as a model It is expressed as a proportion of the total variation (variance) in the predicted variable therefore it has a maximum of 1 and a minimum of 0 R2 If R2 has a value of 1 then this is the same as a correlation of 1, i.e. all the points lie exactly on a straight line. It is more important to understand what r2 means than how to calculate it, but calculation is simple. total sum of squares is the total variation in y, i.e. very similar to the variance residual sum of squares is the variation left over by the model, the variation the model can’t explain (also called error, or epsilon). If the model is perfect then residual sum of squares will be zero. if it is as bad as it could possibly be then residual sum of squares will be the same as the total sum of squares (or very close to it) Don’t worry too much about the link between R2 and pearson correlation = total sum of squares – residual sum of squares total sum of squares Note: the square root of R2 is the Pearson correlation of the two variables

Which variable is a better predictor?
Not going through calculations for R2 , as they are very similar to the ones you saw for covariance last week R2 = 0.16 R2 = 0.55 correlation = 0.39 correlation = 0.74

Is b1 significantly different from zero?
7.2.4 If the true value of b1 in the population is zero, what is the probability that a random sample of this size would have a value of b1 as big or bigger than the observed value? The p value is provided by a t test t = b1 / standard error of b1 standard error of b1 is based on the SD of the residual (deviation) scores if the data points are close to the line of best fit SE will be small, if far away, SE will be large otherwise identical to the t test for the difference between two sample means if p < 0.05 we support the hypothesis that the predictor variable is useful in estimating the value of the outcome You can ask if b0 is significantly different from zero, but for most applications in social science we are not very interested in the intercept, so here we will only assess the statistical significance of b1 Note that the same b1 (slope) value can have different SE and t and p values. This is because the same slope can have the data points close to it, or scattered widely around it, and yet still be the line of best fit in both cases

Are the predictors statistically significant?
Degrees of freedom is in brackets after the t. If this was a one sample t test or related means t test the DOF would be 10. Here it is 9 because the straight line has 2 parameters, b0 and b1, so it “uses” 2 of the 11 DOF available. The “model” you are testing in a one sample t test has only one parameter (it’s just the sample mean, or mean dif score in related means t), hence DOF is 10 in that case, not 9. What this all means is that for the income predictor the b1 value is only 1.29 times the size of the SE, whereas for the inequality predictor the SE is 3.3 times as large as its SE. The inequality b1 is only 0.09% likely to have occurred by chance, while the income b1 is 23% likely to have occurred by chance if the true value of b1 in the popn is 0, i.e. variation in income between countries makes no difference at all to mental health. t(9)1.29, p = 0.23, NS t(9)3.3, p = 0.009

The simple linear regression model
As we saw earlier, simple regression is simply a model of the data as a straight line: Outcome = b0 + (b1 * specific value of predictor) But that equation is not quite complete, because the regression model needs to reflect the fact that the data points rarely all lie exactly on the line Therefore, you will usually see the regression equation written as some symbolic variant of: Outcome = b0 + (b1 * specific value of predictor) + “residual error” residual error refers to the deviation score from the model (the vertical lines drawn on the scatter plots earlier)

Things to bear in mind Often, you can’t be sure that the predicted variable is being caused by the predictor it might be the other way round (and you can swap the predictor and predicted around and run the analysis again) in some cases, e.g. inequality and mental health, it does not make sense to run the regression the other way around Nobody would claim that mental health problems cause inequality but you might make an argument that a 3rd variable causes the variation in both observed variables The predicted variable should be a continuous variable measured on an interval or ratio scale If a scatter plot suggests a non-linear relationship, you can’t use simple regression

If you’d like to evaluate the effects of inequality and other variables yourself more data is available in the book or on the website e.g. more unequal US states have higher levels of health and social problems

Statistical terms for revision
model method of least squares residual regression coefficients intercept, b0 slope or gradient, b1 best fitting line extrapolation R2

PY1PR2 lecture 3: Simple regression

Similar presentations

Presentation on theme: "PY1PR2 lecture 3: Simple regression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PY1PR2 lecture 3: Simple regression

Similar presentations

Presentation on theme: "PY1PR2 lecture 3: Simple regression"— Presentation transcript:

Similar presentations

About project

Feedback