Simple Linear Regression 1
2 I want to start this section with a story. Imagine we take everyone in the class and line them up from shortest to tallest. As you look to the front of the class from your seat the shortest will be on the left and the tallest will be on the right. In fact, in a face to face class we will line you up. Compare yourself to other people and if you are taller than someone else move to the right, if smaller move to the left. Now, imagine we have everyone lined up in order from smallest to tallest. If you are back in your seat and you look down at the line-up (you have to use your imagination because you can not be both in the line-up and in your seat) I bet the line-up looks like the following (when thinking about the height of the people): height5’6”6’1”
3 On the previous screen you see most people are between 5’6” and 6’1”. There are some that are shorter and some that are taller. This is not rocket science, right? From the line-up we could calculate the average height for the group. Now, instead of looking at the height of people, let’s look at the size of their feet. In the same order as height I would venture to say that the size of the feet gets larger as we go from left to right in the room. Imagine you are walking across the room looking down at peoples feet. I say the feet probably looks like the following (I only show three, but I wanted you to fill in the rest):
Overview 4 Imagine you are the first person to get into the room each day. Say you have a class roster so you know the names of all the other people in the class. Also say on each successive day you will try to guess the height of the person who comes into the room first after you. At this point in the story you have to guess without any clue about who will come into the room. I tell you that the best guess you could make each day is to just guess the average height. While you would likely be wrong each day at least you would even out days of being below average and days being above average. Other methods to try to guess at the height might always have you guess too high a value or too low a value.
Overview 5 Now, let’s change the story somewhat. Say before the person enters the room and before you have to guess the height you can see the person’s feet. Would knowing the size of their feet help you guess the height of the person? Since there is a pattern that people with larger feet tend to be taller you could say the height is above average if the feet size is above aveage and the height is below average if the feet are below average. While you probably still not guess the height exactly you would improve on just guessing the average height. So, since foot size and height are related, knowing foot size can help us predict height.
Overview 6 Note in this example that I am not saying that foot size is the cause of height, just that foot size and height are related. Regression analysis is a method to assist us in seeing if variables are related. In this context when we say related we often use the phrase that variables are correlated. Also note that correlation is not causation. Foot size does not cause height. In fact, foot size and height are really caused by other variables such as nutrition and family genes. In business we often seek out relationships between variables to assist us in making sense of the world. The aim is to come up with stories similar to the feet size/height story.
Overview 7 Consider an example about a group of college graduates. Each graduate does not have the same dollar amount of starting salary. Since each graduate does not have the same starting salary amount, an investigation might occur as to why not. In the investigation one might think about other variables that might influence starting salary. Starting salaries could be influenced by, among other things, the gpa of the student, the number of student groups the student was in, or even the work experience of the graduate. This gpa variable might be important because the larger the gpa the more will be the starting salary.
Overview 8 In the example so far, starting salary is called the response variable because the values for starting salary are thought to respond on the values for the other variables. The response variable is often called the y variable and in a graph is put on the vertical axis. GPA, student group and work experience are all examples of explanatory variables. When we use just one explanatory variable with the response variable we have a situation where we can conduct SIMPLE LINEAR REGRESSION. The explanatory variable would be called the x variable and put on the horizontal axis. When two or more explanatory variables are used we could do MULTIPLE REGRESSION. For now we stick with simple linear regression.
Using a Sample to Estimate the Model 9 On the next slide I show some data and scatterplot for the example we have been developing. Note that a sample has been taken from 7 graduates and in the data the gpa and starting salary are in the rows of the table. Each point in the scatterplot is a gpa, starting salary pair for a graduate. With a sample of data we can estimate the regression line as ŷ = b o + b 1 x, where b o is the y intercept of the line and is the value of ŷ when x is 0. The slope b 1 is a number that represents the expected change in ŷ when x increases by 1 unit. By the way ŷ is called y hat and we say that to know we have a regression line.
10 GraduateGPAStart Salary
Least Squares Method 11 X Y Line 1 Line 2 Line 3
Least Squares 12 On the previous slide I show a more generic scatter plot and I put three lines in the graph. All three lines are decent in the sense that with the upward slope they all show the same basic idea as the dots in the graph: as x rises, y rise (meaning x and y are positively related.) In theory we could find the equation for each line by algebra, or something like that. Then for each line we would have a bo and b1 value. Now line 1 is bad because it is too high. What I mean here is that if we used the line to predict y we would always predict too high a number. Similarly with line 3 we would be too low all the time.
Least Squares 13 Line 2 is “among” the data points and when you make predictions with the line sometimes you will be too high and sometimes too low. But, no straight line can be exactly perfect (unless all the points are truly on a straight line, which will likely not happen in business and social research). Line 2 is my interpretation of the line that would be picked by what is called the least squares method. When you look at a y value on the line, called ŷ, the least squares line is placed in such a way the that sum of the squared differences of each dot to the line is minimized. Since each dot has a y, the least squares method picks a bo and b1 such that the resulting differences y minus ŷ when squared, and then summed across all values, is minimized.
14 bo = b1 =
Least Squares 15 For now we will assume Microsoft Excel or some other program can show us the estimated regression line using least squares. We just want to use what we get. On the previous page I have Excel. Note in cell B25 you see the word Coefficients. In cells a26:a27 you see the words Intercept and GPA and then the numbers and are in cells b26:b27. This means ŷ = bo + b1Xi has been estimated to be Starting salary = gpa. Note the data had starting salary measured in thousands. This means, for example, the data had 29.8 but it means the real value is 29,800.
Prediction with least squares 16 Remember our estimated line is Starting Salaries = gpa. Say we want to predict salary if the gpa is 2.7. Starting Salaries = (2.7) = This starting salary is $30,223.82
Interpolation and Extrapolation 17 You will notice in our example data set that the smallest value for x was 2.21 and the largest value was When we want to predict a value of y, ŷ, for a given x, if the x is within the range of the data values for x (2.21 to 3.82 in our example) then we are interpolating. But if an x is outside our range for x we are extrapolating. Extrapolating should be used with a great deal of caution. Maybe the relationship between x and y is different outside the range of our data. If so, and we use the estimated line we may be way off in our predictions. Note the intercept has to be interpreted with similar caution because unless our data includes x’s that include zero in the range, the relationship between x and y could be very different in the x = 0 neighborhood than the one suggested by least squares.
Variation 18 Remember to calculate the standard deviation of a variable we take each value and subtract off the mean and then square the result. (We also the divided by something, but that is not important in this discussion.) In a regression setting on the response variable Y we define the total sum of squares SST as Σ(Yi – Ybar) 2. SST can be rewritten as SST = Σ(Yi – Ŷi + Ŷi –Ybar) 2 = Σ(Ŷi –Ybar) 2 + Σ(Yi – Ŷi) 2 = SSR + SSE. Note: you may recall from algebra that (a + b) 2 = a 2 + 2ab + b 2. In our story here 2ab = 0. While this is not true in general in algebra it is in this context of regression. If this note makes no sense to you do not worry, just use SST = SSR + SSE
Variation 19 So we have SST = Σ(Yi – Ybar) 2, SSR = Σ(Ŷi –Ybar) 2 and SSE = Σ(Yi – Ŷi) 2. On the next slide I have a graph of the data with the regression line put in and a line showing the mean of Y. For each point we could look at the how far the point is from the mean line. This is what SST is looking at. But SSR is indicating that of all the difference in the point and the mean the regression line is able to account for some of that variation. The rest of the difference is SSE.
Variation Y Least Squares regression Line = Ŷi X 20 Y bar Two examples of what is going into SSR Two examples of what is going into SSE
The Coefficient of Determination 21 The coefficient of determination, often denoted r 2, measures the proportion in the variation in Y that is explained by the explanatory variable X in the regression model. r 2 = SSR/SST. In our example from above we have r 2 = SSR/SST = 0.98 rounded to 2 decimals. This means that 98% percent of the variation in starting salary is explained by the variability in the gpa of students. Plus, only 2% of the variability in starting salary is due to other factors.
Coefficient of Determination 22 Say we didn’t have an X variable to help us predict the Y variable. Then a reasonable way to predict Y would be to just use its average or mean value. But, with a regression, by using an X variable it is thought we can do better than just using the mean of Y as a predictor. In a simple linear regression r 2 is an indicator of the strength of the relationship between two variables because the use of the regression model would reduce the variability in predicting the sales by just using the mean sales by the percentage obtained. In different areas of study (like marketing, management, and so on) the idea of what a good r 2 is varies. But, you can be sure if r 2 is.8 or above you have a strong relationship.
Correlation 23 Remember the correlation coefficient r was used to understand the direction and strength of the relationship between two variables. The coefficient r when squared is the r 2 in regression. Regression and correlation are related in this way in the simple linear regression.
Residuals 24 A residual = observed value minus the predicted value = y – ŷ. Back on slide 14 I had the data set and we see, for example, the individual with gps = 2.6 and starting salary = So y = In the equation ŷ = gpa with gpa = 2.6 we get a y hat = (2.6) = and the residual would be 29.8 – =.15 Individual points with large residuals would indicate influential data points.