Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G376 07.

Similar presentations


Presentation on theme: "Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G376 07."— Presentation transcript:

1 Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G376 07

2 In regression analysis, there is a dependent variable (e.g., income), which is the one you are trying to explain, and one or more independent variables (e.g., education level) that are related to it. You can express the relation as an equation, such as: y = a + bx

3 y is the dependent variable (e.g., income is dependent upon educational level) x is the independent variable (e.g., educational level) a is a constant (base income assuming you have no education) b is the slope of the line (how much education affects income) For every increase of 1 in x, y changes by an amount equal to b Some relationships are perfectly linear and fit this equation exactly. Your cell phone bill, for instance, may be: Total Charges = Base Fee + 30¢ (overage minutes) If you know the base fee and the number of overage minutes, you can predict the total charges exactly.

4 Other relationships may not be so exact. Weight, for instance, is to some degree a function of height, but there are variations that height does not explain. On average, you might have an equation like: Weight = -222 + 5.7*Height If you take a sample of actual heights and weights, you might see something like the graph to the right.

5 The line in the graph shows the average relation described by the equation. Often, none of the actual observations lie on the line. The difference between the line and any individual observation is the error or unexplained residual. The new equation is: Weight = -222 + 5.7 × Height + e This equation does not mean that people who are short enough will have a negative weight. The observations that contributed to this analysis were all for heights between 5’ and 6’4”. The model will likely provide a reasonable estimate for anyone in this height range. You cannot, however, extrapolate the results to heights outside of those observed. The regression results are only valid for the range of actual observations.

6 Age vs income

7 Regression finds the line that best fits the observations. It does this by finding the line that results in the lowest sum of squared residuals. That is, the sum of the negative residuals (for points below the line) will exactly equal the sum of the positive residuals (for points above the line). Summing just the residuals wouldn’t be useful because the sum is always zero. So, instead, regression uses the sum of the squares of the residuals. An Ordinary Least Squares (OLS) regression finds the line that results in the lowest sum of squared residuals.

8 How Good is the Model? One of the measures of how well the model explains the data is the R 2 value. Differences between observations that are not explained by the model (i.e., the residuals) remain in the error term. The R 2 value tells you what percentage of those differences is explained by the model. An R 2 of.68 means that 68% of the variance in the observed values of the dependent variable is explained by the model, and 32% of those differences remains unexplained in the error (residual) term.

9 Some of the error is random, and no model will explain it. Someone may be very ambitious, whereas someone else may be willing to sacrifice promotions for more time off. This typically can’t be observed or measured, and these types of effects will vary randomly and unpredictably. Some variance will always remain in the error term. As long as it is random, it is of no concern. Unknowable concerns

10 Some of the error isn’t error Some of the error is best described as unexplained residual—if we added additional variables (such as, age when relating education to income) we might be able to reduce the residual. (See the discussion below on omitted variables.)

11 Each independent variable has another number attached to it in the regression results… its “p-value” or significance level. The p-value is a percentage. It tells you how likely it is that the coefficient for that independent variable emerged by chance and does not describe a real relationship. A p-value of.05 means that there is a 5% chance that the relationship emerged randomly and a 95% chance that the relationship is real. It is generally accepted practice to consider variables with a p-value of less than.1 as significant, though the only basis for this cutoff is convention. “p-values” and Significance Levels

12 Some Things to Watch Out For Omitted Variables Endogeneity Other

13 Omitted Variables If independent variables that have significant relationships with the dependent variable are left out of the model, the results will not be as good as if they are included. In the income example, age is obviously an important variable, since in general the older you get—regardless of your initial income— the more you make.

14 Endogeneity Regression measures the effect of changes in the independent variable on the dependent variable. Endogeneity occurs when that relationship is either backwards or circular, meaning that changes in the dependent variable cause changes in the independent variable. For example, in predicting the value of a home the perceived quality of the local school might affect home values (e.g., people are willing to pay more to move into a neighbourhood if the school is perceived to be a “good” one. But the perceived quality is likely also related to the actual quality, and the actual quality is at least partially a result of funding levels. Funding levels are often related to the property tax base, or the value of local homes. So… good schools increase home values, but high home values also improve schools. This circular relationship, if it is strong, can bias the results of the regression. There are strategies for reducing the bias if removing the endogenous variable is not an option.

15 Other issues There are several other types of biases or sources of distortion that can exist in a regression model for a variety of reasons. Spatial autocorrelation is one significant bias that can greatly affect aspatial regression. There are tests to measure the levels of bias, and there are strategies that can be used to reduce it. Eventually, though, one may have to accept a certain amount of bias in the final model, especially when there are data limitations. In that case, the best that can be done is to describe the problem and the effects it might have when presenting the model.

16 In summary If you read over the web pages (URLs present on the lecture page) you should be able to gain a basic understanding of what regression analysis is, and how to interpret the findings.


Download ppt "Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G376 07."

Similar presentations


Ads by Google