Download presentation
Presentation is loading. Please wait.
Published byChristian Snow Modified over 9 years ago
1
1 The Power of Regression Previous Research Literature Claim Foreign-owned manufacturing plants have greater levels of strike activity than domestic plants In Canada, strike rates of 25.5% versus 20.3% Budd’s Claim Foreign-owned plants are larger and located in strike-prone industries Need multivariate regression analysis!
2
2 The Power of Regression Dependent Variable: Strike Incidence (1)(2)(3) U.S. Corporate Parent (Canadian Parent omitted) 0.230** (0.117) 0.201* (0.119) 0.065 (0.132) Number of Employees (1000s) --- 0.177** (0.019) 0.094** (0.020) Industry Effects?No Yes Sample Size2,170 * Statistically significant at the 0.10 level; ** at the 0.05 level (two-tailed tests).
3
3 Important Regression Topics Prediction Various confidence and prediction intervals Diagnostics Are assumptions for estimation & testing fulfilled? Specifications Quadratic terms? Logarithmic dep. vars.? Additional hypothesis tests Partial F tests Dummy dependent variables Probit and logit models
4
4 Confidence Intervals The true population [whatever] is within the following interval (1- )% of the time: Estimate ± t /2 Standard Error Estimate Just need Estimate Standard Error Shape / Distribution (including degrees of freedom)
5
5 Prediction Interval for New Observation at x p 1. Point Estimate2. Standard Error 3. Shape t distribution with n-k-1 d.f 4. So prediction interval for a new observation is Siegel, p. 481 4. So prediction interval for a new observation is
6
6 Prediction Interval for Mean Observations at x p 1. Point Estimate2. Standard Error 3. Shape t distribution with n-k-1 d.f 4. So prediction interval for a new observation is Siegel, p. 483
7
7 Earlier Example Regression Statistics Multiple R0.770 R Squared0.594 Adj. R Squared0.543 Standard Error10.710 Obs.10 ANOVA dfSSMSFSignificance Regression11340.4521341.45211.6860.009 Residual8917.648114.706 Total92258.100 Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept39.40112.1533.2420.01211.37567.426 hours2.1220.6213.4180.0090.6913.554 Hours of Study (x) and Exam Score (y) Example 1.Find 95% CI for Joe’s exam score (studies for 20 hours) 2.Find 95% CI for mean score for those who studied for 20 hours - x = 18.80
8
8 Diagnostics / Misspecification For estimation & testing to be valid… y = b 0 + b 1 x 1 + b 2 x 2 + … + b k x k + e makes sense Errors (e i ) are independent of each other of the independent variables Homoskedasticity Error variance independent of the independent variables e 2 is a constant Var(e i ) x i 2 (i.e., not heteroskedasticity) Violations render our inferences invalid and misleading!
9
9 Common Problems Misspecification Omitted variable bias Nonlinear rather than linear relationship Levels, logs, or percent changes? Data Problems Skewed variables and outliers Multicollinearity Sample selection (non-random data) Missing data Problems with residuals (error terms) Non-independent errors Heteroskedasticity
10
10 Omitted Variable Bias Question 3 from Sample Exam B wage = 9.05 + 1.39 union (1.65) (0.66) wage = 9.56 + 1.42 union + 3.87 ability (1.49) (0.56) (1.56) wage = -3.03 + 0.60 union + 0.25 revenue (0.70) (0.45) (0.08) H. Farber thinks the average union wage is different from average nonunion wage because unionized employers are more selective and hire individuals with higher ability. M. Friedman thinks the average union wage is different from the average nonunion wage because unionized employers have different levels of revenue per employee.
11
11 Checking the Assumptions How to check the validity of the assumptions? Cynicism, Realism, and Theory Robustness Checks Check different specifications But don’t just choose the best one! Automated Variable Selection Methods e.g., Stepwise regression (Siegel, p. 547) Misspecification and Other Tests Examine Diagnostic Plots
12
12 Diagnostic Plots Increasing spread might indicate heteroskedasticity. Try transformations or weighted least squares.
13
13 Diagnostic Plots “Tilt” from outliers might indicate skewness. Try log transformation
14
14 Problematic Outliers Stock Performance and CEO Golf Handicaps (New York Times, 5-31-98) Number of obs = 44 R-squared = 0.1718 ------------------------------------------------ stockrating | Coef. Std. Err. t P>|t| -------------+---------------------------------- handicap | -1.711.580 -2.95 0.005 _cons | 73.234 8.992 8.14 0.000 ------------------------------------------------ Without 7 “Outliers” Number of obs = 51 R-squared = 0.0017 ------------------------------------------------ stockrating | Coef. Std. Err. t P>|t| -------------+---------------------------------- handicap | -.173.593 -0.29 0.771 _cons | 55.137 9.790 5.63 0.000 ------------------------------------------------ With the 7 “Outliers”
15
15 Are They Really Outliers?? Stock Performance and CEO Golf Handicaps (New York Times, 5-31-98) Diagnostic Plot is OK BE CAREFUL!
16
16 Diagnostic Plots Curvature might indicate nonlinearity. Try quadratic specification
17
17 Diagnostic Plots Good diagnostic plot. Lacks obvious indications of other problems.
18
18 Adding Squared (Quadratic) Term Job Performance regression on Salary (in $1,000s) (Egg Data) Source | SS df MS Number of obs = 576 ------- -+-------------------- F(2,573) = 122.42 Model | 255.61 2 127.8 Prob > F = 0.0000 Residual | 598.22 573 1.044 R-squared = 0.2994 ---------+-------------------- Adj R-squared = 0.2969 Total | 853.83 575 1.485 Root MSE = 1.0218 ---------------+-------------------------------------------- job performance| Coef. Std. Err. t P>|t| ---------------+-------------------------------------------- salary |.0980844.0260215 3.77 0.000 salary squared | -.000337.0001905 -1.77 0.077 _cons | -1.720966.8720358 -1.97 0.049 ------------------------------------------------------------ Salary Squared = Salary 2 [=salary^2 in Excel]
19
19 Quadratic Regression Quadratic regression (nonlinear) Job perf = -1.72 + 0.098 salary – 0.00034 salary squared
20
20 Quadratic Regression Job perf = -1.72 + 0.098 salary – 0.00034 salary squared Effect of salary will eventually turn negative But where? Max = -linear coeff. 2*quadratic coeff.
21
21 Another Specification Possibility If data are very skewed, can try a log specification Can use logs instead of levels for independent and/or dependent variables Note that the interpretation of the coefficients will change Re-familiarize yourself with Siegel, pp. 68-69
22
22 Quick Note on Logs a is the natural logarithm of x if: 2.71828 a = x or, e a = x The natural logarithm is abbreviated “ln” ln(x) = a In Excel, use ln function We call this the “log” but don’t use the “log” function! Usefulness: spreads out small values and narrows large values which can reduce skewness
23
23 Earnings Distribution Weekly Earnings from the March 2002 CPS, n=15,000 Skewed to the right
24
24 Residuals from Levels Regression Residuals from a regression of Weekly Earnings on demographic characteristics Skewed to the right— use of t distribution is suspect
25
25 Log Earnings Distribution Natural Logarithm of Weekly Earnings from the March 2002 CPS, i.e., =ln(weekly earnings) Not perfectly symmetrical, but better
26
26 Residuals from Log Regression Residuals from a regression of Log Weekly Earnings on demographic characteristics Almost symmetrical —use of t distribution is probably OK
27
27 Hypothesis Tests We’ve been doing hypothesis tests for single coefficients H 0 : = 0reject if |t| > t /2,n-k-1 H A : 0 What about testing more than one coefficient at the same time? e.g., want to see if an entire group of 10 dummy variables for 10 industries should be in the model Joint tests can be conducted using partial F tests
28
28 Partial F Tests H 0 : 1 = 2 = 3 = … = C = 0 H A : at least one i 0 How to test this? Consider two regressions One as if H 0 is true i.e., 1 = 2 = 3 = … = C = 0 This is a “restricted” (or constrained) model Plus a “full” (or unconstrained) model in which the computer can estimate what it wants for each coefficient
29
29 Partial F Tests Statistically, need to distinguish between Full regression “no better” than the restricted regression – versus – Full regression is “significantly better” than the restricted regression To do this, look at variance of prediction errors If this declines significantly, then reject H 0 From ANOVA, we know ratio of two variances has an F distribution So use F test
30
30 Partial F Tests SS residual = Sum of Squares Residual C = #constraints The partial F statistic has C, n-k-1 degrees of freedom Reject H 0 if F > F ,C, n-k-1
31
31 Coal Mining Example (Again) Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error108.052 Obs.47 ANOVAdfSSMSFSignificance Regression69975694.9331662615.822142.4060.000 Residual40467007.87511675.197 Total4610442702.809 Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept-168.510258.819-0.6510.519-691.603354.583 hours1.2440.1866.5650.0000.0010.002 tons0.0480.4030.1190.906-0.0010.001 unemp19.6185.6603.4660.0018.17831.058 WWII159.85178.2182.0440.0481.766317.935 Act1952-9.839100.045-0.0980.922-212.038192.360 Act1969-203.010111.535-1.8200.076-428.43122.411
32
32 Minitab Output Predictor Coef StDev T P Constant -168.5 258.8 -0.65 0.519 hours 1.2235 0.186 6.56 0.000 tons 0.0478 0.403 0.12 0.906 unemp 19.618 5.660 3.47 0.001 WWII 159.85 78.22 2.04 0.048 Act1952 -9.8 100.0 -0.10 0.922 Act1969 -203.0 111.5 -1.82 0.076 S = 108.1 R-Sq = 95.5% R-Sq(adj) = 94.9% Analysis of Variance Source DF SS MS F P Regression 6 9975695 1662616 142.41 0.000 Error 40 467008 11675 Total 46 10442703
33
33 Is the Overall Model Significant? H 0 : 1 = 2 = 3 = … = 6 = 0 H A : at least one i 0 Note: for testing the overall model, C=k i.e., testing all coefficients together From the previous slides, we have SSresidual for the “full” (or unconstrained) model SSresidual=467,007.875 But what about for the restricted (H 0 true) regression? Estimate a constant only regression
34
34 Constant-Only Model Regression Statistics R Squared0 Adj. R Squared0 Standard Error476.461 Obs.47 ANOVAdfSSMSFSignificance Regression000.. Residual4610442702.809227015.278 Total4610442702.809 Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept671.93769.4999.6680.0000532.042811.830
35
35 Partial F Tests H 0 : 1 = 2 = 3 = … = 6 = 0 H A : at least one i 0 Reject H 0 if F > F ,C, n-k-1 = F 0.05,6,40 = 2.34 142.406 > 2.34 so reject H 0. Yes, overall model is significant = 142.406
36
36 Select F Distribution 5% Critical Values Numerator Degrees of Freedom 123456… 1161199216225230234 218.519.019.2 19.3 310.19.559.289.129.018.94 85.324.464.073.843.693.58 104.964.103.713.483.333.22 114.843.983.593.363.203.09 124.753.893.493.263.113.00 184.413.553.162.932.772.66 403.943.092.842.462.312.19 10003.853.002.612.382.222.11 … Denominator Degrees of Freedom
37
37 A Small Shortcut Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error108.052 Obs.47 ANOVAdfSSMSFSignificance Regression69975694.9331662615.822142.4060.000 Residual40467007.87511675.197 Total4610442702.809 Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept-168.510258.819-0.6510.519-691.603354.583 hours1.2440.1866.5650.0000.0010.002 tons0.0480.4030.1190.906-0.0010.001 unemp19.6185.6603.4660.0018.17831.058 WWII159.85178.2182.0440.0481.766317.935 Act1952-9.839100.045-0.0980.922-212.038192.360 Act1969-203.010111.535-1.8200.076-428.43122.411 For constant only model, SS residual =10,442,702.809 So to test overall model, you don’t need to run a constant- only model
38
38 An Even Better Shortcut Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error108.052 Obs.47 ANOVAdfSSMSFSignificance Regression69975694.9331662615.822142.4060.000 Residual40467007.87511675.197 Total4610442702.809 Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept-168.510258.819-0.6510.519-691.603354.583 hours1.2440.1866.5650.0000.0010.002 tons0.0480.4030.1190.906-0.0010.001 unemp19.6185.6603.4660.0018.17831.058 WWII159.85178.2182.0440.0481.766317.935 Act1952-9.839100.045-0.0980.922-212.038192.360 Act1969-203.010111.535-1.8200.076-428.43122.411 In fact, the ANOVA table F test is exactly the test for the overall model being significant—recall Unit 8
39
39 Testing Any Subset Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error108.052 Obs.47 ANOVAdfSSMSFSignificance Regression69975694.9331662615.822142.4060.000 Residual40467007.87511675.197 Total4610442702.809 Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept-168.510258.819-0.6510.519-691.603354.583 hours1.2440.1866.5650.0000.0010.002 tons0.0480.4030.1190.906-0.0010.001 unemp19.6185.6603.4660.0018.17831.058 WWII159.85178.2182.0440.0481.766317.935 Act1952-9.839100.045-0.0980.922-212.038192.360 Act1969-203.010111.535-1.8200.076-428.43122.411 Partial F test can be used to test any subset of variables For example, H 0 : WWII = Act1952 = Act1969 = 0 H A : at least one i 0
40
40 Restricted Model Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error108.052 Obs.47 ANOVAdfSSMSFSignificance Regression39837344.763279114.920232.9230.000 Residual43605358.04914078.094 Total4610442702.809 Coeff.Std. Errort statp value Intercept147.821166.4060.8880.379 hours0.00150.000120.5220.000 tons-0.00080.0003-2.5360.015 unemp7.2984.3861.6640.103 Restricted regression with WWII = Act1952 = Act1969 = 0
41
41 Partial F Tests H 0 : WWII = Act1952 = Act1969 = 0 H A : at least one i 0 Reject H 0 if F > F ,C, n-k-1 = F 0.05,3,40 = 2.84 3.95 > 2.84 so reject H 0. Yes, subset of three coefficients are jointly significant = 3.950
42
42 Regression and Two-Way ANOVA Treatments ABC 11098 21265 3181514 42018 5878 Blocks “Stack” data using dummy variables ABCB2B3B4B5Value 100000010 100100012 100010018 100001020 10000018 01000009 01010006 010010015 010001018 01100017 00100008 ……
43
43 Recall Two-Way Results ANOVA: Two-Factor Without Replication Source of Variation SSdfMSFP- value F crit Blocks312.267478.06738.7110.0003.84 Treatment26.533213.2676.5790.0204.46 Error16.13382.017 Total354.93314
44
44 Regression and Two-Way ANOVA Source | SS df MS Number of obs = 15 ----------+---------------------- F( 6, 8) = 28.00 Model | 338.800 6 56.467 Prob > F = 0.0001 Residual | 16.133 8 2.017 R-squared = 0.9545 -------------+------------------- Adj R-squared = 0.9205 Total | 354.933 14 25.352 Root MSE = 1.4201 ------------------------------------------------------------- treatment | Coef. Std. Err. t P>|t| [95% Conf. Int] ----------+-------------------------------------------------- b | -2.600.898 -2.89 0.020 -4.671 -.529 c | -3.000.898 -3.34 0.010 -5.071 -.929 b2 | -1.333 1.160 -1.15 0.283 -4.007 1.340 b3 | 6.667 1.160 5.75 0.000 3.993 9.340 b4 | 9.667 1.160 8.34 0.000 6.993 12.340 b5 | -1.333 1.160 -1.15 0.283 -4.007 1.340 _cons | 10.867.970 11.20 0.000 8.630 13.104 -------------------------------------------------------------
45
45 Regression and Two-Way ANOVA Regression Excerpt for Full Model Source | SS df MS ---------+------------------- Model | 338.800 6 56.467 Residual | 16.133 8 2.017 ---------+------------------- Total | 354.933 14 25.352 Regression Excerpt for b2 = b3 =… 0 Source | SS df MS ---------+------------------- Model | 26.533 2 13.267 Residual | 328.40 12 27.367 ---------+------------------- Total | 354.933 14 25.352 Regression Excerpt for b = c = 0 Source | SS df MS ---------+------------------- Model | 312.267 4 78.067 Residual | 42.667 10 4.267 ---------+------------------- Total | 354.933 14 25.352 Use these SS residual values to do partial F tests and you will get exactly the same answers as the Two- Way ANOVA tests
46
46 Select F Distribution 5% Critical Values Numerator Degrees of Freedom 1234569… 1161199216225230234241 218.519.019.2 19.3 19.4 310.19.559.289.129.018.948.81 85.324.464.073.843.693.583.39 104.964.103.713.483.333.223.02 114.843.983.593.363.203.092.90 124.753.893.493.263.113.002.80 184.413.553.162.932.772.662.46 403.943.092.842.462.312.192.12 10003.853.002.612.382.222.111.89 3.843.002.602.372.212.101.83 Denominator Degrees of Freedom
47
47 3 Seconds of Calculus
48
48 Regression Coefficients y = b 0 + b 1 x (linear form) log(y) = b 0 + b 1 x (semi-log form) log(y) = b 0 + b 1 log(x) (double-log form) 1 unit change in x changes y by b 1 1 unit change in x changes y by b 1 (x100) percent 1 percent change in x changes y by b 1 percent
49
49 Log Regression Coefficients wage = 9.05 + 1.39 union Predicted wage is $1.39 higher for unionized workers (on average) log(wage) = 2.20 + 0.15 union Semi-elasticity Predicted wage is approximately 15% higher for unionized workers (on average) log(wage) = 1.61 + 0.30 log(profits) Elasticity A one percent increase in profits increases predicted wages by approximately 0.3 percent
50
50 Multicollinearity Number of obs = 69 F( 2, 66) = 6.84 Prob > F = 0.0020 R-squared = 0.1718 Adj R-squared = 0.1467 Root MSE =.91445 ---------------------------------------------- repair | Coef. Std. Err. t P>|t| -------+-------------------------------------- weight | -.00017.00038 -0.41 0.685 engine | -.00313.00328 -0.96 0.342 _cons | 4.50161.61987 7.26 0.000 ---------------------------------------------- Auto repair records, weight, and engine size
51
51 Multicollinearity Two (or more) independent variables are so highly correlated that a multiple regression can’t disentangle the unique contributions of each Large standard errors and lack of statistical significance for individual coefficients But joint significance Identifying multicollinearity Some say “rule of thumb |r|>0.70” (or 0.80) But better to look at results OK for prediction Bad for assessing theory
52
52 Prediction With Multicollinearity Prediction at the Mean (weight=3019 and engine=197) Model for prediction Predicted Repair Lower 95% Limit (Mean) Upper 95% Limit (Mean) Multiple Regression 3.4113.1913.631 Weight Only 3.4123.1933.632 Engine Only 3.4103.1923.629
53
53 Dummy Dependent Variables Dummy dependent variables y = b 0 + b 1 x 1 + … + b k x k + e Where y is a {0,1} indicator variable Examples Do you intend to quit? yes / no Did the worker receive training? yes/no Do you think the President is doing a good job? yes/no Was there a strike? yes / no Did the company go bankrupt? yes/no
54
54 Linear Probability Model Mathematically / computationally, can estimate a regression as usual (the monkeys won’t know the difference) This is called a “linear probability model” Right-hand side is linear And is estimating probabilities P(y =1) = b 0 + b 1 x 1 + … + b k x k b 1 =0.15 (for example) means that a one unit change in x 1 increases probability that y=1 by 0.15 (fifteen percentage points)
55
55 Linear Probability Model Excel won’t know the difference, but perhaps it should Linear probability model problems e 2 = P(y=1) [1-P(y=1)] But P(y =1) = b 0 + b 1 x 1 + … + b k x k So e 2 is Predicted probabilities are not bounded by 0,1 R 2 is not an accurate measure of predictive ability Can use a pseudo-R 2 measure Such as percent correctly predicted
56
56 Logit Model & Probit Model Solution to these problems is to use nonlinear functional forms that bound P(y=1) between 0,1 Logit Model (logistic regression) Probit Model Where is the normal cumulative distribution function Recall, ln(x) = a when e a = x
57
57 Logit Model & Probit Model Nonlinear so need statistical package to do the calculations Can do individual (z-tests, not t-tests) and joint statistical testing as with other regressions Also confidence intervals Need to convert coefficients to marginal effects for interpretation Should be aware of these models Though in many cases, a linear probability model works just fine
58
58 Example Dep. Var: 1 if you know of the FMLA, 0 otherwise Probit estimates Number of obs = 1189 LR chi2(14) = 232.39 Prob > chi2 = 0.0000 Log likelihood = -707.94377 Pseudo R2 = 0.1410 ------------------------------------------------------------ FMLAknow | Coef. Std. Err. z P>|z| [95% Conf. Int] ---------+-------------------------------------------------- union |.238.101 2.35 0.019.039.436 age | -.002.018 -0.13 0.897 -.038.033 agesq |.135.219 0.62 0.536 -.293.564 nonwhite | -.571.098 -5.80 0.000 -.764 -.378 income | 1.465.393 3.73 0.000.696 2.235 incomesq | -5.854 2.853 -2.05 0.040 -11.45 -.262 [other controls omitted] _cons | -1.188.328 -3.62 0.000 -1.831 -.545 ------------------------------------------------------------
59
59 Marginal Effects For numerical interpretation / prediction, need to convert coefficients to marginal effects Example: Logit Model So b 1 gives effect on Log(), not P(y=1) Probit is similar Can re-arrange to find out effect on P(y=1) Usually do this at the sample means
60
60 Marginal Effects Probit estimates Number of obs = 1189 LR chi2(14) = 232.39 Prob > chi2 = 0.0000 Log likelihood = -707.94377 Pseudo R2 = 0.1410 ------------------------------------------------------------ FMLAknow | dF/dx Std. Err. z P>|z| [95% Conf. Int] ---------+-------------------------------------------------- union |.095.040 2.35 0.019.017.173 age | -.001.007 -0.13 0.897 -.015.013 agesq |.054.087 0.62 0.536 -.117.225 Nonwhite | -.222.036 -5.80 0.000 -.293 -.151 income |.585.157 3.73 0.000.278.891 incomesq | -2.335 1.138 -2.05 0.040 -4.566 -.105 [other controls omitted] ----------------------------------------------------------- For numerical interpretation / prediction, need to convert coefficients to marginal effects
61
61 But Linear Probability Model is OK, Too Probit Coeff. Union0.238 (0.101) Nonwhite-0.571 (0.098) Income 1.465 (0.393) Income Squared -5.854 (2.853) Probit Marginal 0.095 (0.040) -0.222 (0.037) 0.585 (0.157) -2.335 (1.138) Regression 0.084 (0.035) -0.192 (0.033) 0.442 (0.091) -1.354 (0.316) So regression is usually OK, but should still be familiar with logit and probit methods
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.