Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line that best fits that scatter.

Regression

Idea behind Regression Y X We have a scatter of points, and we want to find the line that best fits that scatter.

For example, we might want to know the relationship between Exam score and hours studied, or Wheat yield and fertilizer usage, or Job performance and job training, or Sales revenue and advertising expenditure.

Imagine that there is a true relationship behind the variables in which we are interested. That relationship is known perhaps to some supreme being. However, we are mere mortals, and the best we can do is to estimate that relationship based on a sample of observations.

The subscript i indicates which observation or which point we are considering. X i is the value of the independent variable for observation i. Y i is the value of the dependent variable.  is the true intercept.  is the true slope.  i is the random error. Perhaps the supreme being feels that the world would be too boring if a particular number of hours studied was always associated with the same exam score, a particular amount of job training always led to the same job performance, etc. So the supreme being tosses in a random error. Then the equation of the true relationship is:

Again the equation of the true relationship is: Our estimated equation is: a is our estimated intercept. b is our estimated slope. e i is the estimation error.

Let’s look at our regression line and one particular observation. Y X XiXi YiYi predicted value of the dependent variable observed value of the dependent variable observed value of the independent variable estimated equation of the line The estimation error, e i, is the gap between the observed value and the predicted value of the dependent variable.

Fitting a scatter of points with a line by eye is too subjective. We need a more rigorous method. We will consider three possible criteria.

Criterion 1: minimize the sum of the vertical errors Y X XiXi YiYi Problem: The best fit by this criterion may not be very good. For points below the estimated regression line, we have a negative error e i. Positive and negative errors cancel each other out. So the points could be far from the line, but we may have a small sum of vertical errors.

Criterion 2: minimize the sum of the absolute values of the vertical errors Y X XiXi YiYi This avoids our previous problem of positive and negative errors canceling each other out. However, the absolute value function is not differentiable, so using calculus to minimize will not work.

Criterion 3: minimize the sum of the squares of the vertical errors Y X XiXi YiYi This also avoids the problem of positive and negative errors canceling each other out. In addition, the square function is differentiable, so using calculus to minimize will work.

Minimizing the sum of the squared errors is the criterion that we will be using. The technique is called least squares or ordinary least squares (OLS).

Using calculus, it can be shown that the values of a and b that give the line with the best fit can be calculated as:

Sometimes we omit the subscripts, since they are understood, and it’s less cumbersome without them. Then the equations are:

Another equivalent formula for b that is sometimes used is: You may use either formula for b in this class.

Example: Determine the least squares regression line for Y = wheat yield and X = fertilizer, using the following data. XYXYX2X2 10040 20050 30050 40070 50065 60065 70080

We need the sums of the X’s, the Y’s, the XY’s and the X 2 ’s XYXYX2X2 100404,00010,000 2005010,00040,000 3005015,00090,000 4007028,00014,000 5006532,50025,000 6006539,00036,000 7008056,00049,000 2800420184,5001,400,000

We also need the means of X and of Y. XYXYX2X2 100404,00010,000 2005010,00040,000 3005015,00090,000 4007028,00014,000 5006532,50025,000 6006539,00036,000 7008056,00049,000 2800420184,5001,400,000

XYXYX2X2 100404,00010,000 2005010,00040,000 3005015,00090,000 4007028,00014,000 5006532,50025,000 6006539,00036,000 7008056,00049,000 2800420184,5001,400,000 Next, we calculate the estimated slope b.

XYXYX2X2 100404,00010,000 2005010,00040,000 3005015,00090,000 4007028,00014,000 5006532,50025,000 6006539,00036,000 7008056,00049,000 2800420184,5001,400,000 Then we calculate the estimated intercept a.

So our estimated regression line is

Given certain assumptions, the OLS estimators can be shown to have certain desirable properties. The assumptions are The Y values are independent of each other. The conditional distributions of Y given X are normal. The conditional standard deviations of Y given X are equal for all values of X.

Gauss-Markov Theorem: If the previous assumptions hold, then the OLS estimators are best, linear, unbiased estimators (BLUE). Linear means that the estimators are linear functions of the observed Y values. (There are no Y 2 s or square roots of Y, etc.) Unbiased means that the expected values of the estimators are equal to the parameters you are trying to estimate. Best means that the estimator has the lowest variance of any linear unbiased estimators of the parameter.

Let’s look at our wheat example using our graph. Consider the fertilizer amount X i = 700. Y X X i =700 Y i = 80 The predicted value of Y corresponding to X = 700 is The observed value of Y corresponding to X = 700 is Y = 80. The average of all Y values is

Y X X i =700 Y i = 80 unexplained deviation explained deviation total deviation The difference between the predicted value of Y and the average value is called the explained deviation. The difference between the observed value of Y and the predicted value is the unexplained deviation. The difference between the observed value of Y and the average value is the total deviation.

If we sum the squares of those deviations, we get

The Sums of Squares are often reported in a Regression ANOVA Table Source of Variation Sum of squares Degrees of freedom Mean square Regression1 MSR SSR/1 Errorn – 2 MSE SSE/(n-2) Totaln – 1 MST SST/(n-1)

Two measures of how well our regression line fits our data. The first measure is the standard error of the estimate or the standard error of the regression, s e or SER. The s e or SER tells you the typical error of fit, or how far the observed value of Y is from the expected value of Y. The second measure of “goodness of fit” is the coefficient of determination or R 2. The R 2 tells you the proportion of the total variation in the dependent variable that is explained by the regression on the independent variable (or variables).

standard error of the estimate or standard error of the regression There is a 2 in the denominator, because we estimated 2 parameters, the intercept a and the slope b. Later, we’ll have more parameters and this will change.

Coefficient of determination or R 2

If the line fits the scatter of points perfectly, the points are all on the regression line and R 2 = 1. If the line doesn’t fit at all and the scatter is just a jumble of points, then R 2 = 0.

Let’s return to our data and calculate s e or SER and R 2. XYXYX2X2 100404,00010,000 2005010,00040,000 3005015,00090,000 4007028,00014,000 5006532,50025,000 6006539,00036,000 7008056,00049,000 2800420184,5001,400,000

XYXYX2X2 Y2Y2 100404,00010,0001600 2005010,00040,0002500 3005015,00090,0002500 4007028,00014,0004900 5006532,50025,0004225 6006539,00036,0004225 7008056,00049,0006400 2800420184,5001,400,00026,350 First, let’s add a column for Y 2.

XYXYX2X2 Y2Y2 100404,00010,0001600 2005010,00040,0002500 3005015,00090,0002500 4007028,00014,0004900 5006532,50025,0004225 6006539,00036,0004225 7008056,00049,0006400 2800420184,5001,400,00026,350 Remember that a = 36.4 and b = 0.059. Then S e or SER

XYXYX2X2 Y2Y2 100404,00010,0001600 2005010,00040,0002500 3005015,00090,0002500 4007028,00014,0004900 5006532,50025,0004225 6006539,00036,0004225 7008056,00049,0006400 2800420184,5001,400,00026,350 Again, a = 36.4 and b = 0.059. So about 85% of the variation in wheat yield is explained by the regression on fertilizer.

The sum of squares error, SSE, is the difference SSE = SST – SSR = 1150 – 973.5 = 176.5. SSR, SSE, and SST for wheat example On the previous slide, we found that R 2 = 973.5 / 1150 = 0.846. SSR SST

What is the square root of R 2 ? It is the sample correlation coefficient, usually denoted by lower case r.

If you don’t already have R 2 calculated, the sample correlation coefficient r can also be calculated from this formula.

For example, in our wheat problem, we had a = 36.4 and b = 0.059. XYXYX2X2 Y2Y2 100404,00010,0001600 2005010,00040,0002500 3005015,00090,0002500 4007028,00014,0004900 5006532,50025,0004225 6006539,00036,0004225 7008056,00049,0006400 2800420184,5001,400,00026,350

The sample correlation coefficient r is often used to estimate the population correlation coefficient  (rho).

 = 1: There is a perfect positive linear relation.  = -1: There is a perfect negative linear relation.  = 0: There is no linear relation. The correlation coefficient (and the covariance) tell how the variables move with each other.

Y X   1 Y X   0 Y X Y X   0.5 Y X   0.8 Y X   -1 Correlation Coefficient Graphs

R 2 adjusted or corrected for degrees of freedom It is possible to compare specifications that would otherwise not be comparable by using the adjusted R 2. The “2” is because we are estimating 2 parameters,  and . This will change when we are estimating more parameters.

Adjusted R 2 for wheat example

Test on the correlation coefficient

Test at the 5% for the wheat example. Recall that r = 0.92 and n = 7.. 025 -2.571 0 2.571 t 5 critical region From our t table, we see that for 5 dof, and a 2-tailed critical region, our cut-off points are -2.571 and 2.571. Since our t value of 5.25 is in the critical region, we reject H 0 and accept H 1 that the population correlation  is not zero.

If our regression line slope estimate b is close to zero, that would indicate that the true slope  might be zero. To test if  equals zero, we need to know the distribution of b. If  is normally distributed with mean 0 and standard deviation  , then b is normally distributed with mean  and standard deviation or standard error

Since we usually don’t know  , we estimate it using SER = s e, and a t n-2 instead of the Z. So for our test statistic, we have sbsb

For the wheat example, test at the 5% level. 025 -2.571 0 2.571 t 5 critical region From our t table, we see that for 5 dof, and a 2-tailed critical region, our cut-off points are -2.571 and 2.571. Since our t value of 5.27 is in the critical region, we reject H 0 and accept H 1 that the slope  is not zero.

This is not a coincidence. When dealing with a regression with a single X value on the right side of the equation, testing whether there is a linear correlation between the 2 variables (  = 0) and testing whether the slope is zero (  = 0) are equivalent. Our values differ only because of rounding error.

We can do an ANOVA test based on the amount of variation in the dependent variable Y that is explained by the regression. This is referred to as testing the significance of the regression. H 0 : there is no linear relationship between X and Y (this is the same thing as  equals zero.) H 1 : there is a linear relationship between X and Y (this is the same thing as  is not zero.)

Example: Test the significance of the regression in the wheat problem at the 5% level. Recall SSR = 973.5 and SSE = 176.5. The F table shows that for 1 and 5 degrees of freedom, the 5% critical value is 6.61. Since our F has a value of 27.58, we reject H 0 : no linear relation and accept H 1 : there is a linear relation between wheat yield and fertilizer. 27.58 f(F 1,5 ) F 1, 5 acceptance region crit. reg. 0.05 6.61

For a regression with just one independent variable X on the right side of the equation, testing the significance of the regression is equivalent to testing whether the slope is zero. Therefore, you might expect there to be a relationship between the statistics used for these tests, and there is one. The F-statistic for this test is the square of the t-statistic for the test on . In our wheat example, the t-statistic for the test on  was 5.27 and the critical value or cut-off point was 2.571. For the F-test, the statistic was 27.58  (5.27) 2 and the critical value or cut-off point was 6.61  (2.571) 2. (The numbers don’t match exactly because of rounding error.)

We can also calculate confidence intervals for the slope .

Calculate a 95% confidence interval for the slope  for the wheat example. Recall that b = 0.059, n = 7, and s b = 0.0112. We also found the critical values for a 2-tailed t with 5 dof are 2.571 and -2.571.

Our 95% confidence interval means that we are 95% sure that the true slope of the relationship is between 0.031 and 0.087. Since zero is not in this interval, the results also imply that for a 5% test level, we would reject and accept

Sometimes we want to calculate forecasting intervals for predicted Y values. For example, perhaps we’re working for an agricultural agency. A farmer calls to ask us for an estimate of the wheat yield that might be expected based on a particular fertilizer usage level on the farmer’s wheat field. We might reply that we are 95% certain that the yield would be between 60 and 80 bushels per acre. A representative from a cereal company might ask for an estimate of the average wheat yield that might be expected based on that same fertilizer usage level on many wheat fields. To that question, we might reply that we are 95% certain that the yield would be between 65 and 75 bushels per acre. Our intervals would both be centered around the same number (70 in this example), but we can give a more precise prediction for an average of many fields, than we can for an individual field.

The width of our forecasting intervals also depends on our level of expertise with the specified value of the independent variable. Recall that the fertilizer values in our wheat problem had a mean of 400 and were all between 100 and 700. If someone asks about applying 2000 units of fertilizer to a field, we would probably feel less comfortable with our prediction than we would if the person asked about applying 500 units of fertilizer. The closer the value of X is to the mean value of our sample, the more comfortable we are with our numbers, and the narrower the interval required for a particular confidence level.

Y X Forecasting intervals for the individual case and for the mean of many cases. upper endpoint for the forecasting interval for the individual case lower endpoint for the forecasting interval for the individual case upper endpoint for the forecasting interval for the mean of many cases lower endpoint for the forecasting interval for the mean of many cases regression line Notice that the intervals for individual case are narrower that those for the average of many cases. Also all the intervals are narrower near the sample mean of the independent variable.

Y X For the given level of X requested by our callers, we would have the following. confidence interval for the individual case X given 70 confidence interval for the mean of many cases 75 65 80 60

Formulae for forecasting intervals forecasting interval for individual case: forecasting interval for the mean of many cases:

Example: If 550 pounds of fertilizer are applied in our wheat example, find the 95% forecasting interval for the mean wheat yield if we fertilized many fields.

Example: If 550 pounds of fertilizer are applied in our wheat example, find the 95% forecasting interval for the wheat yield if we fertilized one field.

Notice that, as we stated previously, the interval for the mean of many cases is narrower than the interval for the individual case.

Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line that best fits that scatter.

Similar presentations

Presentation on theme: "Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line that best fits that scatter."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line that best fits that scatter.

Similar presentations

Presentation on theme: "Regression. Idea behind Regression Y X We have a scatter of points, and we want to find the line that best fits that scatter."— Presentation transcript:

Similar presentations

About project

Feedback