1 The Power of Regression Previous Research Literature Claim Foreign-owned manufacturing plants have greater levels of strike activity than domestic plants In Canada, strike rates of 25.5% versus 20.3% Budd’s Claim Foreign-owned plants are larger and located in strike-prone industries Need multivariate regression analysis!
2 The Power of Regression Dependent Variable: Strike Incidence (1)(2)(3) U.S. Corporate Parent (Canadian Parent omitted) 0.230** (0.117) 0.201* (0.119) (0.132) Number of Employees (1000s) ** (0.019) 0.094** (0.020) Industry Effects?No Yes Sample Size2,170 * Statistically significant at the 0.10 level; ** at the 0.05 level (two-tailed tests).
3 Important Regression Topics Prediction Various confidence and prediction intervals Diagnostics Are assumptions for estimation & testing fulfilled? Specifications Quadratic terms? Logarithmic dep. vars.? Additional hypothesis tests Partial F tests Dummy dependent variables Probit and logit models
4 Confidence Intervals The true population [whatever] is within the following interval (1- )% of the time: Estimate ± t /2 Standard Error Estimate Just need Estimate Standard Error Shape / Distribution (including degrees of freedom)
5 Prediction Interval for New Observation at x p 1. Point Estimate2. Standard Error 3. Shape t distribution with n-k-1 d.f 4. So prediction interval for a new observation is Siegel, p So prediction interval for a new observation is
6 Prediction Interval for Mean Observations at x p 1. Point Estimate2. Standard Error 3. Shape t distribution with n-k-1 d.f 4. So prediction interval for a new observation is Siegel, p. 483
7 Earlier Example Regression Statistics Multiple R0.770 R Squared0.594 Adj. R Squared0.543 Standard Error Obs.10 ANOVA dfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours Hours of Study (x) and Exam Score (y) Example 1.Find 95% CI for Joe’s exam score (studies for 20 hours) 2.Find 95% CI for mean score for those who studied for 20 hours - x = 18.80
8 Diagnostics / Misspecification For estimation & testing to be valid… y = b 0 + b 1 x 1 + b 2 x 2 + … + b k x k + e makes sense Errors (e i ) are independent of each other of the independent variables Homoskedasticity Error variance independent of the independent variables e 2 is a constant Var(e i ) x i 2 (i.e., not heteroskedasticity) Violations render our inferences invalid and misleading!
9 Common Problems Misspecification Omitted variable bias Nonlinear rather than linear relationship Levels, logs, or percent changes? Data Problems Skewed variables and outliers Multicollinearity Sample selection (non-random data) Missing data Problems with residuals (error terms) Non-independent errors Heteroskedasticity
10 Omitted Variable Bias Question 3 from Sample Exam B wage = union (1.65) (0.66) wage = union ability (1.49) (0.56) (1.56) wage = union revenue (0.70) (0.45) (0.08) H. Farber thinks the average union wage is different from average nonunion wage because unionized employers are more selective and hire individuals with higher ability. M. Friedman thinks the average union wage is different from the average nonunion wage because unionized employers have different levels of revenue per employee.
11 Checking the Assumptions How to check the validity of the assumptions? Cynicism, Realism, and Theory Robustness Checks Check different specifications But don’t just choose the best one! Automated Variable Selection Methods e.g., Stepwise regression (Siegel, p. 547) Misspecification and Other Tests Examine Diagnostic Plots
12 Diagnostic Plots Increasing spread might indicate heteroskedasticity. Try transformations or weighted least squares.
13 Diagnostic Plots “Tilt” from outliers might indicate skewness. Try log transformation
14 Problematic Outliers Stock Performance and CEO Golf Handicaps (New York Times, ) Number of obs = 44 R-squared = stockrating | Coef. Std. Err. t P>|t| handicap | _cons | Without 7 “Outliers” Number of obs = 51 R-squared = stockrating | Coef. Std. Err. t P>|t| handicap | _cons | With the 7 “Outliers”
15 Are They Really Outliers?? Stock Performance and CEO Golf Handicaps (New York Times, ) Diagnostic Plot is OK BE CAREFUL!
16 Diagnostic Plots Curvature might indicate nonlinearity. Try quadratic specification
17 Diagnostic Plots Good diagnostic plot. Lacks obvious indications of other problems.
18 Adding Squared (Quadratic) Term Job Performance regression on Salary (in $1,000s) (Egg Data) Source | SS df MS Number of obs = F(2,573) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = job performance| Coef. Std. Err. t P>|t| salary | salary squared | _cons | Salary Squared = Salary 2 [=salary^2 in Excel]
19 Quadratic Regression Quadratic regression (nonlinear) Job perf = salary – salary squared
20 Quadratic Regression Job perf = salary – salary squared Effect of salary will eventually turn negative But where? Max = -linear coeff. 2*quadratic coeff.
21 Another Specification Possibility If data are very skewed, can try a log specification Can use logs instead of levels for independent and/or dependent variables Note that the interpretation of the coefficients will change Re-familiarize yourself with Siegel, pp
22 Quick Note on Logs a is the natural logarithm of x if: a = x or, e a = x The natural logarithm is abbreviated “ln” ln(x) = a In Excel, use ln function We call this the “log” but don’t use the “log” function! Usefulness: spreads out small values and narrows large values which can reduce skewness
23 Earnings Distribution Weekly Earnings from the March 2002 CPS, n=15,000 Skewed to the right
24 Residuals from Levels Regression Residuals from a regression of Weekly Earnings on demographic characteristics Skewed to the right— use of t distribution is suspect
25 Log Earnings Distribution Natural Logarithm of Weekly Earnings from the March 2002 CPS, i.e., =ln(weekly earnings) Not perfectly symmetrical, but better
26 Residuals from Log Regression Residuals from a regression of Log Weekly Earnings on demographic characteristics Almost symmetrical —use of t distribution is probably OK
27 Hypothesis Tests We’ve been doing hypothesis tests for single coefficients H 0 : = 0reject if |t| > t /2,n-k-1 H A : 0 What about testing more than one coefficient at the same time? e.g., want to see if an entire group of 10 dummy variables for 10 industries should be in the model Joint tests can be conducted using partial F tests
28 Partial F Tests H 0 : 1 = 2 = 3 = … = C = 0 H A : at least one i 0 How to test this? Consider two regressions One as if H 0 is true i.e., 1 = 2 = 3 = … = C = 0 This is a “restricted” (or constrained) model Plus a “full” (or unconstrained) model in which the computer can estimate what it wants for each coefficient
29 Partial F Tests Statistically, need to distinguish between Full regression “no better” than the restricted regression – versus – Full regression is “significantly better” than the restricted regression To do this, look at variance of prediction errors If this declines significantly, then reject H 0 From ANOVA, we know ratio of two variances has an F distribution So use F test
30 Partial F Tests SS residual = Sum of Squares Residual C = #constraints The partial F statistic has C, n-k-1 degrees of freedom Reject H 0 if F > F ,C, n-k-1
31 Coal Mining Example (Again) Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours tons unemp WWII Act Act
32 Minitab Output Predictor Coef StDev T P Constant hours tons unemp WWII Act Act S = R-Sq = 95.5% R-Sq(adj) = 94.9% Analysis of Variance Source DF SS MS F P Regression Error Total
33 Is the Overall Model Significant? H 0 : 1 = 2 = 3 = … = 6 = 0 H A : at least one i 0 Note: for testing the overall model, C=k i.e., testing all coefficients together From the previous slides, we have SSresidual for the “full” (or unconstrained) model SSresidual=467, But what about for the restricted (H 0 true) regression? Estimate a constant only regression
34 Constant-Only Model Regression Statistics R Squared0 Adj. R Squared0 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression000.. Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept
35 Partial F Tests H 0 : 1 = 2 = 3 = … = 6 = 0 H A : at least one i 0 Reject H 0 if F > F ,C, n-k-1 = F 0.05,6,40 = > 2.34 so reject H 0. Yes, overall model is significant =
36 Select F Distribution 5% Critical Values Numerator Degrees of Freedom … … Denominator Degrees of Freedom
37 A Small Shortcut Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours tons unemp WWII Act Act For constant only model, SS residual =10,442, So to test overall model, you don’t need to run a constant- only model
38 An Even Better Shortcut Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours tons unemp WWII Act Act In fact, the ANOVA table F test is exactly the test for the overall model being significant—recall Unit 8
39 Testing Any Subset Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours tons unemp WWII Act Act Partial F test can be used to test any subset of variables For example, H 0 : WWII = Act1952 = Act1969 = 0 H A : at least one i 0
40 Restricted Model Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp value Intercept hours tons unemp Restricted regression with WWII = Act1952 = Act1969 = 0
41 Partial F Tests H 0 : WWII = Act1952 = Act1969 = 0 H A : at least one i 0 Reject H 0 if F > F ,C, n-k-1 = F 0.05,3,40 = > 2.84 so reject H 0. Yes, subset of three coefficients are jointly significant = 3.950
42 Regression and Two-Way ANOVA Treatments ABC Blocks “Stack” data using dummy variables ABCB2B3B4B5Value ……
43 Recall Two-Way Results ANOVA: Two-Factor Without Replication Source of Variation SSdfMSFP- value F crit Blocks Treatment Error Total
44 Regression and Two-Way ANOVA Source | SS df MS Number of obs = F( 6, 8) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = treatment | Coef. Std. Err. t P>|t| [95% Conf. Int] b | c | b2 | b3 | b4 | b5 | _cons |
45 Regression and Two-Way ANOVA Regression Excerpt for Full Model Source | SS df MS Model | Residual | Total | Regression Excerpt for b2 = b3 =… 0 Source | SS df MS Model | Residual | Total | Regression Excerpt for b = c = 0 Source | SS df MS Model | Residual | Total | Use these SS residual values to do partial F tests and you will get exactly the same answers as the Two- Way ANOVA tests
46 Select F Distribution 5% Critical Values Numerator Degrees of Freedom … Denominator Degrees of Freedom
47 3 Seconds of Calculus
48 Regression Coefficients y = b 0 + b 1 x (linear form) log(y) = b 0 + b 1 x (semi-log form) log(y) = b 0 + b 1 log(x) (double-log form) 1 unit change in x changes y by b 1 1 unit change in x changes y by b 1 (x100) percent 1 percent change in x changes y by b 1 percent
49 Log Regression Coefficients wage = union Predicted wage is $1.39 higher for unionized workers (on average) log(wage) = union Semi-elasticity Predicted wage is approximately 15% higher for unionized workers (on average) log(wage) = log(profits) Elasticity A one percent increase in profits increases predicted wages by approximately 0.3 percent
50 Multicollinearity Number of obs = 69 F( 2, 66) = 6.84 Prob > F = R-squared = Adj R-squared = Root MSE = repair | Coef. Std. Err. t P>|t| weight | engine | _cons | Auto repair records, weight, and engine size
51 Multicollinearity Two (or more) independent variables are so highly correlated that a multiple regression can’t disentangle the unique contributions of each Large standard errors and lack of statistical significance for individual coefficients But joint significance Identifying multicollinearity Some say “rule of thumb |r|>0.70” (or 0.80) But better to look at results OK for prediction Bad for assessing theory
52 Prediction With Multicollinearity Prediction at the Mean (weight=3019 and engine=197) Model for prediction Predicted Repair Lower 95% Limit (Mean) Upper 95% Limit (Mean) Multiple Regression Weight Only Engine Only
53 Dummy Dependent Variables Dummy dependent variables y = b 0 + b 1 x 1 + … + b k x k + e Where y is a {0,1} indicator variable Examples Do you intend to quit? yes / no Did the worker receive training? yes/no Do you think the President is doing a good job? yes/no Was there a strike? yes / no Did the company go bankrupt? yes/no
54 Linear Probability Model Mathematically / computationally, can estimate a regression as usual (the monkeys won’t know the difference) This is called a “linear probability model” Right-hand side is linear And is estimating probabilities P(y =1) = b 0 + b 1 x 1 + … + b k x k b 1 =0.15 (for example) means that a one unit change in x 1 increases probability that y=1 by 0.15 (fifteen percentage points)
55 Linear Probability Model Excel won’t know the difference, but perhaps it should Linear probability model problems e 2 = P(y=1) [1-P(y=1)] But P(y =1) = b 0 + b 1 x 1 + … + b k x k So e 2 is Predicted probabilities are not bounded by 0,1 R 2 is not an accurate measure of predictive ability Can use a pseudo-R 2 measure Such as percent correctly predicted
56 Logit Model & Probit Model Solution to these problems is to use nonlinear functional forms that bound P(y=1) between 0,1 Logit Model (logistic regression) Probit Model Where is the normal cumulative distribution function Recall, ln(x) = a when e a = x
57 Logit Model & Probit Model Nonlinear so need statistical package to do the calculations Can do individual (z-tests, not t-tests) and joint statistical testing as with other regressions Also confidence intervals Need to convert coefficients to marginal effects for interpretation Should be aware of these models Though in many cases, a linear probability model works just fine
58 Example Dep. Var: 1 if you know of the FMLA, 0 otherwise Probit estimates Number of obs = 1189 LR chi2(14) = Prob > chi2 = Log likelihood = Pseudo R2 = FMLAknow | Coef. Std. Err. z P>|z| [95% Conf. Int] union | age | agesq | nonwhite | income | incomesq | [other controls omitted] _cons |
59 Marginal Effects For numerical interpretation / prediction, need to convert coefficients to marginal effects Example: Logit Model So b 1 gives effect on Log(), not P(y=1) Probit is similar Can re-arrange to find out effect on P(y=1) Usually do this at the sample means
60 Marginal Effects Probit estimates Number of obs = 1189 LR chi2(14) = Prob > chi2 = Log likelihood = Pseudo R2 = FMLAknow | dF/dx Std. Err. z P>|z| [95% Conf. Int] union | age | agesq | Nonwhite | income | incomesq | [other controls omitted] For numerical interpretation / prediction, need to convert coefficients to marginal effects
61 But Linear Probability Model is OK, Too Probit Coeff. Union0.238 (0.101) Nonwhite (0.098) Income (0.393) Income Squared (2.853) Probit Marginal (0.040) (0.037) (0.157) (1.138) Regression (0.035) (0.033) (0.091) (0.316) So regression is usually OK, but should still be familiar with logit and probit methods