Regression: An Introduction LIR 832. Regression Introduced Topics of the day: A. What does OLS do? Why use OLS? How does it work? B. Residuals: What we.

Regression: An Introduction LIR 832

Regression Introduced Topics of the day: A. What does OLS do? Why use OLS? How does it work? B. Residuals: What we don’t know. C. Moving to the Multi-variate Model D. Quality of Regression Equations: R 2

Regression Example #1 Just what is regression and what can it do? To address this, consider the study of truck driver turnover in the first lecture…

Regression Example #2 Suppose that we are interested in understanding the determinants of teacher pay. What we have is a data set on average per- pupil expenditures and average teacher pay by state…

Regression Example #2 Descriptive Statistics: pay, expenditures Variable N Mean Median TrMean StDev SE Mean pay 51 24356 23382 23999 4179 585 expendit 51 3697 3554 3596 1055 148 Variable Minimum Maximum Q1 Q3 pay 18095 41480 21419 26610 expendit 2297 8349 2967 4123

Regression Example #2 Covariances: pay, expenditures pay expendit pay 17467605 expendit 3679754 1112520 Correlations: pay, expenditures Pearson correlation of pay and expenditures = 0.835 P-Value = 0.000

Regression Example #2

The regression equation is pay = 12129 + 3.31 expenditures Predictor Coef SE Coef T P Constant 12129 1197 10.13 0.000 expendit 3.3076 0.3117 10.61 0.000 S = 2325 R-Sq = 69.7% R-Sq(adj) = 69.1% pay = 12129 + 3.31 expenditures is the equation of a line and we can add it to our plot of the data.

Regression Example #2 Pay = 12129 +3.31*Expenditures

Regression: What Can We Learn? What can we learn from the regression? Q1: What is the relationship between per pupil expenditures and teacher pay? A: For every additional dollar of expenditure, pay increases by $3.31.

Regression: What Can We Learn? Q2: Given our sample, is it reasonable to suppose that increased teacher expenditures are associated with higher pay? H 0 : expenditures make no difference: β ≤0 H A : expenditures increase pay: β >0 P( (xbar - μ )/ σ > (3.037 - 0)/.3117) = p( z > 10.61) A: Reject our null, reasonable to believe there is a positive relationship.

Regression: What Can We Learn? Q3: What proportion of the variance in teacher pay can we explain with our regression line? A: R-Sq = 69.7%

Regression: What Can We Learn? Q4: We can also make predictions from the regression model. What would teacher pay be if we spent $4,000 per pupil? A: pay = 12129 + 3.31 expenditures pay = 12129 + 3.31*4000 = $25,369 What if we had per pupil expenditures of $6400 (Michigan’s amount)? Pay = 12129 + 3.31*6400 = $33,313

Regression: What Can We Learn? Q5: For the states where we have data, we can also observe the difference between our prediction and the actual amount. A: Take the case of Alaska: expenditures$8,349 actual pay$41,480 predicted pay = 12129 + 3.31*8,349 = 38744 difference between actual and predicted pay: 41480 - 38744 = $1,735

Regression: What Can We Learn? Note that we have under predicted actual pay. Why might this occur? This is called the residual, it is a measure of the imperfection of our model What is the residual for the state of Maine? per pupil expenditure is $3346 actual teacher pay is $19,583

Regression: What Can We Learn? Residual (e) = Actual - Predicted

Regression Nomenclature

Components of a Regression Model Dependent variable: we are trying to explain the movement of the dependent variable around its mean. Explanatory variable(s): We use these variables to explain the movement of the dependent variable. Error Term: This is the difference between what we can account for with our explanatory variables and the actual value taken on by the dependent variable. Parameter: The measure of the relationship between an explanatory variable and a dependent variable.

Regression Models are Linear Q: What do we mean by “linear”? A: The equation takes the form:

Regression Example #3 Using numbers, lets make up an equation for a compensation bonus system in which everyone starts with a bonus of $500 annually and then receives an additional $100 for every point earned. Now create a table relating job points to bonus income

Basic model takes the form: Y = β 0 + β 1 *X + ε or, for the bonus pay example, Pay = $500 + $100*expenditure + ε

Regression Example #3 This is the equation of a line where: $500 is the minimum bonus when the individual has no bonus points. This is the intercept of the line $100 is the increase in the total bonus for every additional job point. This is the slope of the line Or: β 0 is the intercept of the vertical axis (Y axis) when X = 0 β 1 is the change in Y for every 1 unit change in X, or:

Regression Example #3 For points on the line: Let X 1 = 10 & X 2 = 20 Using our line: Y 1 = $500 + $100*10 = $1,500 Y 2 = $500 +$100*20 = $2,500

1. The change in bonus pay for a 1 point increase in job points: 2. What do we mean by “linear”? Equation of a line: Y = β 0 + β 1 *X + ε is the equation of a line

Regression Example #3 Equation of a line which is linear in coefficients but not variables: Y = β 0 + β 1 *X + β 2 *X 2 + ε Think about a new bonus equation: Base Bonus is still $500 You now get $0 per bonus point and $10 per bonus point squared

Linearity of Regression Models Y = β 0 + β 2 *X β + ε is not the equation of a line Regression has to be linear in coefficients, not variables We can mimic curves and much else if we are clever

The Error Term The error term is the difference between what has occurred and what we predict as an outcome. Our models are imperfect because omitted “minor” influences measurement error in Y and X’s issues of functional form (linear model for non-linear relationship) pure randomness of behavior

The Error Term Our full equation is Y = β 0 + β 1 *X + ε However, we often write the deterministic part of our model as: E(Y|X) = β 0 + β 1 *X We use of “conditional on X” similar to conditional probabilities. Essentially saying this is our best guess about Y given the value of X.

The Error Term This is also written as Note that is called “Y-hat,” the estimate of Y So we can write the full model as: What does this mean in practice? Same x value may produce somewhat different Y values. Our predictions are imperfect!

Populations, Samples, and Regression Analysis Population Regression: Y = β 0 + β 1 X 1 + ε The population regression is the equation for the entire group of interest. Similar in concept to μ, the population mean The population regression is indicated with Greek letters. The population regression is typically not observed.

Populations, Samples, and Regression Analysis Sample Regressions: As with means, we take samples and use these samples to learn about (make inferences about) populations (and population regressions) The sample regression is written as y i = b 0 + b 1 x 1i + e i or as

Populations, Samples, and Regression Analysis

As with all sample results, there are lots of samples which might be drawn from a population. These samples will typically provide somewhat different estimates of the coefficients. This is, once more, sampling variation.

Populations and Samples: Regression Example Illustrative Exercise: 1. Estimate a simple regression model for all of the data on managers and professionals, then take random 10% subsamples of the data and compare the estimates! 2. Sample estimates are generated by assigning a number between 0 and 1 to every observation using a uniform distribution. We then chose observations for all of the numbers betwee 0 and 0.1, 0.1 and 0.2, 0.3 and 0.3, etc.

Populations and Samples: Regression Example POPULATION ESTIMATES: Results for: lir832-managers-and- professionals-2000.mtw The regression equation is weekearn = - 485 + 87.5 years ed 47576 cases used 7582 cases contain missing values Predictor Coef SE Coef T P Constant -484.57 18.18 -26.65 0.000 years ed 87.492 1.143 76.54 0.000 S = 530.5 R-Sq = 11.0% R-Sq(adj) = 11.0% Analysis of Variance Source DF SS MS F P Regression 1 1648936872 1648936872 5858.92 0.000 Residual Error 47574 13389254994 281441 Total 47575 15038191866

Side Note: Reading Output The regression equation is weekearn = - 485 + 87.5 years ed [equation with dependent variable] 47576 cases used 7582 cases contain missing values [number of observations and number with missing data - why is the latter important] Predictor Coef SE Coef T P Constant -484.57 18.18 -26.65 0.000 years ed 87.492 1.143 76.54 0.000 [detailed information on estimated coefficients, standard error, t against a null of zero, and a p against a null of 0] S = 530.5 R-Sq = 11.0% R-Sq(adj) = 11.0% [two goodness of fit measures]

Side Note: Reading Output Analysis of Variance Source DF SS MS F P Regression 1 1648936872 1648936872 5858.92 0.000ESS Residual Error 47574 13389254994 281441SSR Total 47575 15038191866TSS [This tells us the number of degrees of freedom, the explained sum of squares, the residual sum of squares, the total sum of squares and some test statistics]

Populations and Samples: Regression Example

SAMPLE 1 RESULTS The regression equation is weekearn = - 333 + 79.2 Education 4719 cases used 726 cases contain missing values Predictor Coef SE Coef T P Constant -333.24 58.12 -5.73 0.000 Educatio 79.208 3.665 21.61 0.000 S = 539.5 R-Sq = 9.0% R-Sq(adj) = 9.0%

Populations and Samples: Regression Example SAMPLE 4 RESULTS The regression equation is weekearn = - 502 + 88.4 Education 4708 cases used 787 cases contain missing values Predictor Coef SE Coef T P Constant -502.18 57.51 -8.73 0.000 Educatio 88.437 3.632 24.35 0.000 S = 535.6 R-Sq = 11.2% R-Sq(adj) = 11.2%

Populations and Samples: Regression Example SAMPLE 5 RESULTS The regression equation is weekearn = - 485 + 87.9 Education 4737 cases used 787 cases contain missing values Predictor Coef SE Coef T P Constant -485.19 56.60 -8.57 0.000 Educatio 87.875 3.572 24.60 0.000 S = 523.4 R-Sq = 11.3% R-Sq(adj) = 11.3%

Populations and Samples: A Recap of the Example

The sample estimates are not exactly equal to the population estimates. Different samples produce different estimates of the slope and intercept.

Ordinary Least Squares (OLS): How We Determine the Estimates The residual is a measure of what we do not know. e i = y i - b 0 + b 1 x 1i We want e i to be as small as possible How do we choose (b 0, b 1 )? AKA: Criteria for the sample regression: Chose among lines so that: The average value of the residual is zero. Statistically, this occurs through any line that passes through the means (X-bar, Y-bar). Problem: there are an infinity of lines which meet this criteria.

Example of a Possible Regression Line Mean = $24,356 Mean = $3,696

Problem: Many Lines Meet That Criteria Mean = $24,356 Mean = $3,696

OLS: Choosing the Coefficients Among these lines, find (b 0, b 1 ) pair which minimizes the sum of squared residuals: Want to make the difference between the prediction and the actual value, (Y - E(Y|X)), as small as possible. Squaring puts greater weight on avoiding large individual differences between actual and predicted values. So we will chose the middle course, middle sized errors, rather than a combination of large and small errors.

OLS: Choosing the Coefficients

What are the characteristics of a sample regression? D. It can be shown that, if these two conditions hold, our regression line is: Best Linear Unbiased Estimator (or B-L-U-E) This is called the: Gauss-Markov Theorem

OLS: Choosing the Coefficients Descriptive Statistics: pay, expenditures Variable N Mean Median TrMean StDev SE Mean pay 51 24356 23382 23999 4179 585 expendit 51 3697 3554 3596 1055 148 Variable Minimum Maximum Q1 Q3 pay 18095 41480 21419 26610

OLS: Choosing the Coefficients The regression equation is pay = 12129 + 3.31 expenditures Predictor Coef SE Coef T P Constant 12129 1197 10.13 0.000 expendit 3.3076 0.3117 10.61 0.000 S = 2325 R-Sq = 69.7% R-Sq(adj) = 69.1% Analysis of Variance Source DF SS MS F P Regression 1 608555015 608555015 112.60 0.000 Residual Error 49 264825250 5404597 Total 50 873380265

OLS: Choosing the Coefficients Mean of the residuals equal to zero? Descriptive Statistics: Residual Variable N Mean Median TrMean StDev SE Mean Residual 51 -0 -218 -107 2301 322 Variable Minimum Maximum Q1 Q3 Residual -3848 5529 -2002 1689

OLS: Choosing the Coefficients Passes Through the Point of Means? pay = 12129 + 3.3076 expenditures Variable N Mean pay 51 24356 expendit 51 3697 $24,356 = 12129 + 3.3076*3697 $24,356 = 12129 + 12,122.20 $24,356 = $24357.2 Not too bad with rounding!

OLS: Demonstrating Residuals Mean = $24,356 Mean = $3,696 e1e1 e2e2

How Does OLS Know Which Line is BLUE? If we are trying to minimize the sum of squared residuals, we can manipulate the model to find the following: y i = b 0 + b 1 x 1i + e i e i = y i - b 0 - b 1 x 1i Therefore: Thus, since BLUE requires us to minimize the sum of squared residuals, OLS chooses the b 0 and b 1 to minimize the right side (since we know y and x).

How Does OLS Calculate the Coefficients? The formulas used for the coefficients are as follows:

Illustrative Example: Attendance and Output We want to build a model of output based on attendance. We hypothesize the following: output =  0 +  1 *attendance + 

Example Results The regression equation is output = 15.7 + 3.32 attend Predictor Coef SE Coef T P Constant 15.733 3.247 4.85 0.017 attend 3.3190 0.6392 5.19 0.014 S = 3.079 R-Sq = 90.0% R-Sq(adj) = 86.6% Analysis of Variance Source DF SS MS F P Regression 1 255.56 255.56 26.96 0.014 Residual Error 3 28.44 9.48 Total 4 284.00 Obs attend output Fit SE Fit Residual St Resid 1 8.00 40.00 42.28 2.57 -2.28 -1.35 2 3.00 28.00 25.69 1.72 2.31 0.90 3 2.00 20.00 22.37 2.16 -2.37 -1.08 4 6.00 39.00 35.65 1.64 3.35 1.29 5 4.00 28.00 29.01 1.43 -1.01 -0.37

Computing the Coefficients So, b 1 = 3.31896. Thus, b 0 = ybar-b 1 *xbar = 31-3.31896*4.6 = 15.732

Example: Residual Analysis Variable N Mean Median TrMean StDev SE Mean C15 5 0.00 -1.01 0.00 2.66 1.19 Variable Minimum Maximum Q1 Q3 C15 -2.37 3.35 -2.33 2.83

Exercise We are interested in the relationship between the number of weeks an employee has been in some firm sponsored training course and output. We have data on three employees. Thus, compute the coefficients for the following model: output =  0 +  1 *training + 

Exercise: Worksheet Using the data, calculate b1 and b0:

OLS: The Intercept (b O ) Why you shouldn’t spend too much time worrying about the value of the intercept: b 0 = 24356 - 3.3076*3697 = 12129 Note that b 0 is the value for pay if expenditures were equal to 0, something we may never observe.

Multiple Regression Few outcomes are determined by a single factor: 1. We know that gender plays an important role in determining pay. Is gender the only factor? 2. What is likely to matter in determining attendance at a work site: our program holidays weather illness demographics of the labor force

Multiple Regression A complete model of an outcome will depend not only on inclusion of our explanatory variable of interest, but also including other variables which we believe influence our outcome. Getting the “correct” estimates of our coefficients depends on specifying the balance of the equation correctly. This raises the bar in our work.

Multiple Regression: Example An example with Weekly Earnings: 1. Regress weekly earnings of managers on education 2. Add age and gender to the model 3. Add weekly hours to the model

Example: Weekly Earnings The regression equation is weekearn = - 485 + 87.5 years ed 47576 cases used 7582 cases contain missing values Predictor Coef SE Coef T P Constant -484.57 18.18 -26.65 0.000 years ed 87.492 1.143 76.54 0.000 S = 530.5 R-Sq = 11.0% R-Sq(adj) = 11.0% Analysis of Variance Source DF SS MS F P Regression 1 1648936872 1648936872 5858.92 0.000 Residual Error 47574 13389254994 281441 Total 47575 15038191866

Example: Weekly Earnings The regression equation is weekearn = - 402 + 76.4 years ed + 6.29 age - 319 Female 47576 cases used 7582 cases contain missing values Predictor Coef SE Coef T P Constant -401.76 18.87 -21.29 0.000 age 6.2874 0.2021 31.11 0.000 Female -318.522 4.625 -68.87 0.000 years ed 76.432 1.089 70.16 0.000 S = 500.4 R-Sq = 20.8% R-Sq(adj) = 20.8% Analysis of Variance Source DF SS MS F P Regression 3 3126586576 1042195525 4162.27 0.000 Residual Error 47572 11911605290 250391 Total 47575 15038191866

Example: Weekly Earnings The regression equation is weekearn = - 1055 + 65.7 years ed + 6.87 age - 229 Female + 18.2 uhour-cd 44839 cases used 10319 cases contain missing values Predictor Coef SE Coef T P Constant -1054.63 19.48 -54.15 0.000 age 6.8736 0.1932 35.57 0.000 Female -229.466 4.490 -51.10 0.000 uhour-cd 18.2205 0.2183 83.47 0.000 years ed 65.701 1.041 63.12 0.000 S = 459.1 R-Sq = 31.8% R-Sq(adj) = 31.8% Analysis of Variance Source DF SS MS F P Regression 4 4415565740 1103891435 5237.13 0.000 Residual Error 44834 9450180490 210782 Total 44838 13865746230

Example: Weekly Earnings In the last model, how does age affect weekly earnings? How does gender affect weekly earnings? How do average weekly hours of work affect weekly earnings? How does the estimated effect of education change as we add these “control variables”?

Interpreting the Coefficients In the last model, the coefficient on education indicates that for every additional year of education a manager earnings an additional 65.09 per month, holding their age, gender, and hours of work constant. E(Weekly Income|education,age, gender, hours of work) Alternatively, it is the difference in weekly earnings between two individuals who, except for a one year difference in years of education, are the same age and gender and work the same weekly hours (otherwise equivalent managers)

Interpreting the Coefficients The coefficient on gender indicates that women managers earn $229.79 less than male managers who are otherwise similar in education, age and weekly hours of work. Note the similarity to the comparative static exercises in labor economics in which we attempt to tease out the effect of one factor holding all other factors constant. What is the effect of raising the demand for labor, holding supply of labor constant? What is the effect on the wage of an improvement in working conditions, holding other compensation related factors constant (Theory of compensating differentials).

The Effect of Adding Variables The addition of factors to a model don’t always make a difference. Example: Model of teacher pay as a function of expenditures per pupil. Does region make a difference.

The Effect of Adding Variables Regression Analysis: pay versus expenditure The regression equation is pay = 12129 + 3.31 expenditures Predictor Coef SE Coef T P Constant 12129 1197 10.13 0.000 expendit 3.3076 0.3117 10.61 0.000 S = 2325 R-Sq = 69.7% R-Sq(adj) = 69.1%

The Effect of Adding Variables Regression Analysis: pay versus expenditures, NE, S The regression equation is pay = 13269 + 3.29 expenditures - 1674 NE - 1144 S Predictor Coef SE Coef T P Constant 13269 1395 9.51 0.000 expendit 3.2888 0.3176 10.35 0.000 NE -1673.5 801.2 -2.09 0.042 S -1144.2 861.1 -1.33 0.190 S = 2270 R-Sq = 72.3% R-Sq(adj) = 70.5% Region matters, but its influence on the expenditure/pay relationship is de minimis.

Evaluating the Results We will consider a number of criteria in analyzing a regression equation. Before touching the data Is the equation supported by sound theory? Are all the obviously important variables included in the model? Should we be using OLS to estimate this Model (what is OLS)? Has the correct form been used to estimate the model?

Evaluating the Results The data itself: Is the data set a reasonable size and accurate? The results: How well does the estimated regression fit the data? Do the estimated coefficients correspond to the expectations developed by the researcher before the data was collected? Does the regression appear to be free of major econometric problems?

Evaluating the Results: R-Squared (Goodness of Fit) R 2 (also seen as r 2 ), the Coefficient of Determination: We would like a simple measure which tells us how well our equation fits out data. This is R 2 ( Coefficient of Determination) For example: in our teacher pay model: R 2 = 69.7% For attendance/output: R 2 = 86.6% For our weekly earnings model R 2 varies from 10.6% to 31.9%

R-Squared (Goodness of Fit) What is R 2 ? The percentage of the total movement of the dependent variable around its mean (variance *n) explained by the explanatory variable.

R-Squared (Goodness of Fit) Concept of R 2 : Our dependent variable Y, moves around its mean We are trying to explain that movement with out x’s. If we are doing well, then it should be the case that most of the movement of Y should be explained (predicted) by the X’s. That suggests that explained movement should be large and unexplained movement should be small.

R-Squared (Goodness of Fit)

Note: 0 < R 2 < 1 Suppose that we a regression which explains nothing. Then the ESS = 0 and the measure is equal to zero. Now suppose we have a model which fits the data exactly. Every movement in y is correctly predicted. Then the ESS = TSS and our measure is equal to 1.

R-Squared (Goodness of Fit) In other words, as we approach R 2 =1, our ability to explain movement in the dependent variable increases. Most of our results will fall into the middle range between 0 and 1.

R-Squared (Goodness of Fit)

Returning to Weekly Earnings of Managers Examples The regression equation is weekearn = - 485 + 87.5 years ed [equation with dependent variable] 47576 cases used 7582 cases contain missing values [number of observations and number with missing data - why is the latter important] Predictor Coef SE Coef T P Constant -484.57 18.18 -26.65 0.000 years ed 87.492 1.143 76.54 0.000 [detailed information on estimated coefficients, standard error, t against a null of zero, and a p against a null of 0] S = 530.5 R-Sq = 11.0% R-Sq(adj) = 11.0% [two goodness of fit measures]

Returning to Weekly Earnings of Managers Examples Regression Analysis: weekearn versus Education The regression equation is weekearn = - 442 + 85.2 Education 47576 cases used 7582 cases contain missing values Predictor Coef SE Coef T P Constant -442.42 17.99 -24.59 0.000 Educatio 85.228 1.136 75.01 0.000 S = 531.7 R-Sq = 10.6% R-Sq(adj) = 10.6% Analysis of Variance Source DF SS MS F P Regression 1 1590256151 1590256151 5625.76 0.000 Residual Error 47574 13447935715 282674 Total 47575 15038191866

Returning to Weekly Earnings of Managers Examples Regression Analysis: weekearn versus Education, age, female The regression equation is weekearn = - 382 + 75.0 Education + 6.53 age - 320 female 47576 cases used 7582 cases contain missing values Predictor Coef SE Coef T P Constant -382.38 18.78 -20.36 0.000 Educatio 74.967 1.079 69.45 0.000 age 6.5320 0.2020 32.34 0.000 female -319.952 4.628 -69.14 0.000 S = 500.9 R-Sq = 20.6% R-Sq(adj) = 20.6% Analysis of Variance Source DF SS MS F P Regression 3 3103974768 1034658256 4124.34 0.000 Residual Error 47572 11934217098 250866 Total 47575 15038191866

Returning to Weekly Earnings of Managers Examples Regression Analysis: weekearn versus Education, age, female, hours The regression equation is weekearn = - 1053 + 65.1 Education + 7.07 age - 230 female + 18.3 hours 44839 cases used 10319 cases contain missing values Predictor Coef SE Coef T P Constant -1053.01 19.43 -54.20 0.000 Educatio 65.089 1.029 63.27 0.000 age 7.0741 0.1929 36.68 0.000 female -229.786 4.489 -51.19 0.000 hours 18.3369 0.2180 84.11 0.000 S = 459.0 R-Sq = 31.9% R-Sq(adj) = 31.9%

Returning to Weekly Earnings of Managers Examples So the fit of the final model, with a control for hours of work, is considerably better than the fit for a model which added gender and age and much better than the fit of a model with just education as an explanatory variable.

Adjusted R-Squared (“R-bar Squared”) First limitation of R 2: 1. As we add variables, the magnitude of ESS never falls and typically increases. If we just use R 2 as a criteria for adding variables to a model, we will keep adding infinitum. R 2 never falls and usually increases as one adds variables. 2. Instead use the measure R-bar squared. This measure is calculated as:

Adjusted R-Squared (“R-bar Squared”)

As k, the number of regressors becomes large, R-bar-squared becomes smaller, all else constant. It imposes a penalty on adding variables which really have very little to do with the dependent variable. If you add irrelevant variables, R 2 may remain the same or increase, but R-bar- squared may well fall.

Adjusted R-Squared (“R-bar Squared”) Regression Analysis: pay versus expenditure The regression equation is pay = 12129 + 3.31 expenditures Predictor Coef SE Coef T P Constant 12129 1197 10.13 0.000 expendit 3.3076 0.3117 10.61 0.000 S = 2325 R-Sq = 69.7% R-Sq(adj) = 69.1%

Adjusted R-Squared (“R-bar Squared”) Regression Analysis: pay versus expenditures, NE, S The regression equation is pay = 13269 + 3.29 expenditures - 1674 NE - 1144 S Predictor Coef SE Coef T P Constant 13269 1395 9.51 0.000 expendit 3.2888 0.3176 10.35 0.000 NE -1673.5 801.2 -2.09 0.042 S -1144.2 861.1 -1.33 0.190 S = 2270 R-Sq = 72.3% R-Sq(adj) = 70.5%

Adjusted R-Squared (“R-bar Squared”) Note that the increase in R-bar-squared is more modest than the increase in R 2. This is because the explanatory power of region is modest and the effect of that power in reducing the RSS is being counter-balanced by the increase in the number of parameters.

Adjusted R-Squared (“R-bar Squared”) Need to be careful in the use of to compare regressions. It can be good in comparing specifications such as with the variables in our specification for managers. Confirms our view that weekly pay in influenced education but also by age, gender and hours (note that both r and r-bar increase). It is not good for comparing different equations with different data sets.

Example: Teachers’ Pay Our model using state average earnings and expenditures has a R-sq of 72.3% Regression Analysis: pay versus expenditures, NE, S The regression equation is pay = 13269 + 3.29 expenditures - 1674 NE - 1144 S Predictor Coef SE Coef T P Constant 13269 1395 9.51 0.000 expendit 3.2888 0.3176 10.35 0.000 NE -1673.5 801.2 -2.09 0.042 S -1144.2 861.1 -1.33 0.190 S = 2270 R-Sq = 72.3% R-Sq(adj) = 70.5%

Example: Teachers’ Pay Now consider a micro-data model: Use our CPS data set for 2000 and merge the expenditure data into data on individual teachers. Using STATE DATA:. reg teacherpay expenditures Source | SS df MS Number of obs = 51 -------------+------------------------------ F( 1, 49) = 112.60 Model | 608555015 1 608555015 Prob > F = 0.0000 Residual | 264825250 49 5404596.94 R-squared = 0.6968 -------------+------------------------------ Adj R-squared = 0.6906 Total | 873380265 50 17467605.3 Root MSE = 2324.8 ------------------------------------------------------------------------------ teacherpay | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- expenditures | 3.307585.3117043 10.61 0.000 2.681192 3.933978 _cons | 12129.37 1197.351 10.13 0.000 9723.205 14535.54 ------------------------------------------------------------------------------

Example: Teachers’ Pay Now Shift to Year 2000 micro-data and append state expenditures on education:. summ weekearn age female uhour1 expenditure if pocc1 >= 151 & pocc1 <= 159 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------- weekearn | 7579 702.5348 429.1002.02 2884.61 age | 7903 41.63254 11.9243 15 90 female | 7903.732127.4428791 0 1 uhour1 | 7903 35.51411 15.84321 -4 99 expenditures | 7903 3786.745 990.0271 2297 8349

Example: Teachers’ Pay Now Estimate a Regression Equation Similar to the State Data Equation Note the number of observations:. reg weekearn expenditure NE Midwest South if pocc1 >= 151 & pocc1 <= 159 Source | SS df MS Number of obs = 7579 -------------+------------------------------ F( 4, 7574) = 29.17 Model | 21170657.3 4 5292664.32 Prob > F = 0.0000 Residual | 1.3741e+09 7574 181429.033 R-squared = 0.0152 -------------+------------------------------ Adj R-squared = 0.0147 Total | 1.3953e+09 7578 184126.967 Root MSE = 425.94 ------------------------------------------------------------------------------ weekearn | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- expenditures |.0390489.0060053 6.50 0.000.0272769.0508209 NE | 58.93258 15.97818 3.69 0.000 27.61091 90.25425 Midwest | 32.89631 14.26238 2.31 0.021 4.938092 60.85454 South | 2.824219 13.6974 0.21 0.837 -24.02649 29.67493 _cons | 533.6598 24.94855 21.39 0.000 484.7538 582.5659 ------------------------------------------------------------------------------ For every $1 in expenditures we get 3.9¢ in teacher pay per week or, on a 52 week basis, $2.08!

Example: Teachers’ Pay Build a more suitable model and R-sq increases.. reg weekearn expenditure female black NE Midwest South age coned if pocc1 >= 151 & pocc1 <= 159 Source | SS df MS Number of obs = 7479 -------------+------------------------------ F( 8, 7579) = 16.17 Model | 19477648.0 8 2434706.00 Prob > F = 0.0000 Residual | 61297869.1 407 150609.015 R-squared = 0.2411 -------------+------------------------------ Adj R-squared = 0.2262 Total | 80775517.1 415 194639.80 Root MSE = 388.08 ------------------------------------------------------------------------------ weekearn | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- expenditures |.0636406.0726116 0.88 0.381 -.0791.2063811 female | -88.36832 43.00688 -2.05 0.041 -172.9117 -3.824972 black | 72.28883 48.35753 1.49 0.136 -22.77287 167.3505 NE | -84.77944 116.1567 -0.73 0.466 -313.1213 143.5625 Midwest | -38.8376 54.29961 -0.72 0.475 -145.5803 67.9051 South | -.8350449 48.33866 -0.02 0.986 -95.85964 94.18955 age | 8.797438 1.653155 5.32 0.000 5.547649 12.04723 coned | 81.40359 11.48141 7.09 0.000 58.83332 103.9739 _cons | -1089.701 300.7268 -3.62 0.000 -1680.873 -498.5296 ------------------------------------------------------------------------------

Example: Teachers’ Pay Why the difference in R-sq? Different levels of aggregation of data lead to different total variance. Micro-data has much more variance than state average data (why might this be)? Times Series data often has r-sq of.98 or.99. As a result, we cannot use R-sq to compare the results across different data sets or types of regressions. It can be useful to compare specifications within a particular model.

Correlation & R-Squared R 2 and ρ: What is the relationship? ρ is the population value of the correlation, in the sample the symbol for correlation is r. If r is the correlation between X and Y, then R 2, the goodness of fit measure of a regression equation is r 2. Note that this ONLY holds for bi-variate relationships. An example for the relationship between education expenditures and teacher pay:

Correlation & R-Squared: Example Results for: Teacher Expenditure.MTW Correlations: pay, expenditures Pearson correlation of pay and expenditures = 0.835 P-Value = 0.000 Regression Analysis: pay versus expenditures The regression equation is pay = 12129 + 3.31 expenditures Predictor Coef SE Coef T P Constant 12129 1197 10.13 0.000 expendit 3.3076 0.3117 10.61 0.000 S = 2325 R-Sq = 69.7% R-Sq(adj) = 69.1% r 2 =.835 2 =.697225 = R 2

Regression: An Introduction LIR 832. Regression Introduced Topics of the day: A. What does OLS do? Why use OLS? How does it work? B. Residuals: What we.

Similar presentations

Presentation on theme: "Regression: An Introduction LIR 832. Regression Introduced Topics of the day: A. What does OLS do? Why use OLS? How does it work? B. Residuals: What we."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression: An Introduction LIR 832. Regression Introduced Topics of the day: A. What does OLS do? Why use OLS? How does it work? B. Residuals: What we.

Similar presentations

Presentation on theme: "Regression: An Introduction LIR 832. Regression Introduced Topics of the day: A. What does OLS do? Why use OLS? How does it work? B. Residuals: What we."— Presentation transcript:

Similar presentations

About project

Feedback