Download presentation
1
The Basics of Regression continued
2
Overview The text uses one example to guide you, but I will use a different example. Remember we use statistics to try to understand more about a variable in the world. Let’s focus on income as the variable of interest. Obviously, not everyone has the same income. We might want to understand why. Another thing we might do is try to predict someone’s income. The easiest way to predict a person’s income is just take the average income of the group. But regression techniques are an attempt to improve on prediction over just picking the average. It is thought that by including another variable in the study we can improve on prediction.
3
graph y - income note here in the scatterplot that the higher the schooling, the higher the income. Thus, knowing schooling will permit better prediction of income. y x - years of schooling In a graph we put the variable of interest on the y axis. Here it is thought that knowing the years of schooling for a person will better help us understand income and schooling is put on the x axis. In other words, certain values of income are ‘matched’ with schooling amounts.
4
graph y - income regression line y x - years of schooling
A point of interest in regression is to come up with the mathematical formula for a line that best describes the data points. The line would then be used to make predictions about the y values given x values.
5
Math form It is thought that in the population the variable x and y are related in the following general form: y = B0 + B1 x + e, where e is an error term that captures all those influences on y not picked up by x, B0 is the y intercept of the line, and B1 is the slope of the line. When we have a sample of data from a population we will say in general the regression line is estimated to be ^ y = b0 + b1 x, where the ‘hats’ refer to estimated values.
6
ordinary least squares
The typical method used to pick the line through the data is called the ordinary least squares line. This method is the one that minimizes the sum of squared deviations of the data points to the line. The line has desirable properties(not proven here): 1) It is unbiased - if many samples were taken, the average of the intercepts and slopes from the samples would be the population intercept and slope. 2) It is consistent - ‘large’ samples would give the population intercept and slope as well.
7
confidence interval When we have a sample from a population and we use OLS to get the slope, we know that value is dependant on the sample. If we had a different sample the slope estimate would be different. So instead of using a point estimate of the population slope we often use a confidence interval. A 95% confidence interval would mean we could be 95% confident the true unknown slope is in this interval. We form the interval by ^ b1 - (1.96 times the standard error of b1) and b1 + (1.96 times the standard error of b1)
8
microsoft excel Many computer programs will give the standard error of the slope estimate and/or give the confidence interval as well. Note if the confidence interval includes the value 0, then this is a sign the x variable really does not help us understand the y variable.
9
hypothesis test about the slope
In regression analysis if the slope value is zero we know the x variable doesn’t help us understand the y variable. So in our test of hypothesis we assume the slope is zero. If we reject this null hypothesis then we can conclude that the x variable does help us understand the y variable. The slope estimate does not have a normal distribution, but has a t distribution, which is close to the normal. We would use the t distribution to test the null hypothesis.
10
p value Microsoft Excel gives a p-value for the slope estimate. This value is the probability of getting this slope estimate, or a more extreme one under the assumption the true population slope is zero. The logic here is 1) values far from 0 are less likely to occur or have low p-values, 2) we arbitrarily choose a p-value of .05 as a cut-off value. This means if we get a p-value of less than .05 for the slope estimate then we will say we reject the null of no influence because in one sample it doesn’t seem right that we obtained a low probability value in our one sample.
11
Goodness of fit R2 When we look at the data points and their relationship to the line we talk about how good is the fit of the line. R2 is a numerical summary of the goodness of fit. Its value has a range of 0 to 1, being closer to 1 being the better fit. In fact R2 has the interpretation of indicating the % of variation in the y variable that is explained by the x variable.
12
Forecasting Once we have the regression estimate and we have done a hypothesis test to be confident x does help in explaining y, then we may want to forecast values of y given some x values. Say we have income = years of schooling. Then if years of schooling is 16 income is predicted to be – (16) = 26.25 Note here that the intercept of the line is minus 1.75 and the slope is This is only a coincidence.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.