1 The Power of Regression Previous Research Literature Claim Foreign-owned manufacturing plants have greater levels of strike activity than domestic plants.

Slides:



Advertisements
Similar presentations
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Advertisements

Econ 488 Lecture 5 – Hypothesis Testing Cameron Kaplan.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Heteroskedasticity The Problem:
TigerStat ECOTS Understanding the population of rare and endangered Amur tigers in Siberia. [Gerow et al. (2006)] Estimating the Age distribution.
INTERPRETATION OF A REGRESSION EQUATION
Sociology 601 Class 19: November 3, 2008 Review of correlation and standardized coefficients Statistical inference for the slope (9.5) Violations of Model.
Valuation 4: Econometrics Why econometrics? What are the tasks? Specification and estimation Hypotheses testing Example study.
Chapter 13 Multiple Regression
Multiple Linear Regression Model
Chapter 12 Multiple Regression
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
Econ Prof. Buckles1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
Sociology 601 Class 28: December 8, 2009 Homework 10 Review –polynomials –interaction effects Logistic regressions –log odds as outcome –compared to linear.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Multiple Regression Models
1Prof. Dr. Rainer Stachuletz Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 4. Further Issues.
1 Review of Correlation A correlation coefficient measures the strength of a linear relation between two measurement variables. The measure is based on.
So far, we have considered regression models with dummy variables of independent variables. In this lecture, we will study regression models whose dependent.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
1 Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Consider the variable.
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
Topic 3: Regression.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Interpreting Bi-variate OLS Regression
1 Regression and Calibration EPP 245 Statistical Analysis of Laboratory Data.
Back to House Prices… Our failure to reject the null hypothesis implies that the housing stock has no effect on prices – Note the phrase “cannot reject”
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
DUMMY CLASSIFICATION WITH MORE THAN TWO CATEGORIES This sequence explains how to extend the dummy variable technique to handle a qualitative explanatory.
Multiple Linear Regression Analysis
Chapter 13: Inference in Regression
Hypothesis Testing in Linear Regression Analysis
Linear Regression Inference
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
How do Lawyers Set fees?. Learning Objectives 1.Model i.e. “Story” or question 2.Multiple regression review 3.Omitted variables (our first failure of.
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
MultiCollinearity. The Nature of the Problem OLS requires that the explanatory variables are independent of error term But they may not always be independent.
1 1 Slide © 2003 Thomson/South-Western Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
CHAPTER 14 MULTIPLE REGRESSION
Regression Continued: Functional Form LIR 832. Topics for the Evening 1. Qualitative Variables 2. Non-linear Estimation.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory variables.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Chapter 19 Analysis of Variance (ANOVA). ANOVA How to test a null hypothesis that the means of more than two populations are equal. H 0 :  1 =  2 =
Biostat 200 Lecture Simple linear regression Population regression equationμ y|x = α +  x α and  are constants and are called the coefficients.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Statistics: Unlocking the Power of Data Lock 5 STAT 101 Dr. Kari Lock Morgan 11/20/12 Multiple Regression SECTIONS 9.2, 10.1, 10.2 Multiple explanatory.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Multiple Regression II 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 2) Terry Dielman.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
1 Estimating and Testing  2 0 (n-1)s 2 /  2 has a  2 distribution with n-1 degrees of freedom Like other parameters, can create CIs and hypothesis tests.
BPS - 5th Ed. Chapter 231 Inference for Regression.
VARIABLE MISSPECIFICATION I: OMISSION OF A RELEVANT VARIABLE In this sequence and the next we will investigate the consequences of misspecifying the regression.
 List the characteristics of the F distribution.  Conduct a test of hypothesis to determine whether the variances of two populations are equal.  Discuss.
assignment 7 solutions ► office networks ► super staffing
CHAPTER 7 Linear Correlation & Regression Methods
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
Hypothesis Testing Make a tentative assumption about a parameter
Covariance x – x > 0 x (x,y) y – y > 0 y x and y axes.
Essentials of Statistics for Business and Economics (8e)
Presentation transcript:

1 The Power of Regression Previous Research Literature Claim Foreign-owned manufacturing plants have greater levels of strike activity than domestic plants In Canada, strike rates of 25.5% versus 20.3% Budd’s Claim Foreign-owned plants are larger and located in strike-prone industries Need multivariate regression analysis!

2 The Power of Regression Dependent Variable: Strike Incidence (1)(2)(3) U.S. Corporate Parent (Canadian Parent omitted) 0.230** (0.117) 0.201* (0.119) (0.132) Number of Employees (1000s) ** (0.019) 0.094** (0.020) Industry Effects?No Yes Sample Size2,170 * Statistically significant at the 0.10 level; ** at the 0.05 level (two-tailed tests).

3 Important Regression Topics Prediction Various confidence and prediction intervals Diagnostics Are assumptions for estimation & testing fulfilled? Specifications Quadratic terms? Logarithmic dep. vars.? Additional hypothesis tests Partial F tests Dummy dependent variables Probit and logit models

4 Confidence Intervals The true population [whatever] is within the following interval (1-  )% of the time: Estimate ± t  /2  Standard Error Estimate Just need Estimate Standard Error Shape / Distribution (including degrees of freedom)

5 Prediction Interval for New Observation at x p 1. Point Estimate2. Standard Error 3. Shape t distribution with n-k-1 d.f 4. So prediction interval for a new observation is Siegel, p So prediction interval for a new observation is

6 Prediction Interval for Mean Observations at x p 1. Point Estimate2. Standard Error 3. Shape t distribution with n-k-1 d.f 4. So prediction interval for a new observation is Siegel, p. 483

7 Earlier Example Regression Statistics Multiple R0.770 R Squared0.594 Adj. R Squared0.543 Standard Error Obs.10 ANOVA dfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours Hours of Study (x) and Exam Score (y) Example 1.Find 95% CI for Joe’s exam score (studies for 20 hours) 2.Find 95% CI for mean score for those who studied for 20 hours - x = 18.80

8 Diagnostics / Misspecification For estimation & testing to be valid… y = b 0 + b 1 x 1 + b 2 x 2 + … + b k x k + e makes sense Errors (e i ) are independent of each other of the independent variables Homoskedasticity Error variance independent of the independent variables  e 2 is a constant Var(e i )  x i  2 (i.e., not heteroskedasticity) Violations render our inferences invalid and misleading!

9 Common Problems Misspecification Omitted variable bias Nonlinear rather than linear relationship Levels, logs, or percent changes? Data Problems Skewed variables and outliers Multicollinearity Sample selection (non-random data) Missing data Problems with residuals (error terms) Non-independent errors Heteroskedasticity

10 Omitted Variable Bias Question 3 from Sample Exam B wage = union (1.65) (0.66) wage = union ability (1.49) (0.56) (1.56) wage = union revenue (0.70) (0.45) (0.08) H. Farber thinks the average union wage is different from average nonunion wage because unionized employers are more selective and hire individuals with higher ability. M. Friedman thinks the average union wage is different from the average nonunion wage because unionized employers have different levels of revenue per employee.

11 Checking the Assumptions How to check the validity of the assumptions? Cynicism, Realism, and Theory Robustness Checks Check different specifications But don’t just choose the best one! Automated Variable Selection Methods e.g., Stepwise regression (Siegel, p. 547) Misspecification and Other Tests Examine Diagnostic Plots

12 Diagnostic Plots Increasing spread might indicate heteroskedasticity. Try transformations or weighted least squares.

13 Diagnostic Plots “Tilt” from outliers might indicate skewness. Try log transformation

14 Problematic Outliers Stock Performance and CEO Golf Handicaps (New York Times, ) Number of obs = 44 R-squared = stockrating | Coef. Std. Err. t P>|t| handicap | _cons | Without 7 “Outliers” Number of obs = 51 R-squared = stockrating | Coef. Std. Err. t P>|t| handicap | _cons | With the 7 “Outliers”

15 Are They Really Outliers?? Stock Performance and CEO Golf Handicaps (New York Times, ) Diagnostic Plot is OK BE CAREFUL!

16 Diagnostic Plots Curvature might indicate nonlinearity. Try quadratic specification

17 Diagnostic Plots Good diagnostic plot. Lacks obvious indications of other problems.

18 Adding Squared (Quadratic) Term Job Performance regression on Salary (in $1,000s) (Egg Data) Source | SS df MS Number of obs = F(2,573) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = job performance| Coef. Std. Err. t P>|t| salary | salary squared | _cons | Salary Squared = Salary 2 [=salary^2 in Excel]

19 Quadratic Regression Quadratic regression (nonlinear) Job perf = salary – salary squared

20 Quadratic Regression Job perf = salary – salary squared Effect of salary will eventually turn negative But where? Max = -linear coeff. 2*quadratic coeff.

21 Another Specification Possibility If data are very skewed, can try a log specification Can use logs instead of levels for independent and/or dependent variables Note that the interpretation of the coefficients will change Re-familiarize yourself with Siegel, pp

22 Quick Note on Logs a is the natural logarithm of x if: a = x or, e a = x The natural logarithm is abbreviated “ln” ln(x) = a In Excel, use ln function We call this the “log” but don’t use the “log” function! Usefulness: spreads out small values and narrows large values which can reduce skewness

23 Earnings Distribution Weekly Earnings from the March 2002 CPS, n=15,000 Skewed to the right

24 Residuals from Levels Regression Residuals from a regression of Weekly Earnings on demographic characteristics Skewed to the right— use of t distribution is suspect

25 Log Earnings Distribution Natural Logarithm of Weekly Earnings from the March 2002 CPS, i.e., =ln(weekly earnings) Not perfectly symmetrical, but better

26 Residuals from Log Regression Residuals from a regression of Log Weekly Earnings on demographic characteristics Almost symmetrical —use of t distribution is probably OK

27 Hypothesis Tests We’ve been doing hypothesis tests for single coefficients H 0 :  = 0reject if |t| > t  /2,n-k-1 H A :   0 What about testing more than one coefficient at the same time? e.g., want to see if an entire group of 10 dummy variables for 10 industries should be in the model Joint tests can be conducted using partial F tests

28 Partial F Tests H 0 :  1 =  2 =  3 = … =  C = 0 H A : at least one  i  0 How to test this? Consider two regressions One as if H 0 is true i.e.,  1 =  2 =  3 = … =  C = 0 This is a “restricted” (or constrained) model Plus a “full” (or unconstrained) model in which the computer can estimate what it wants for each coefficient

29 Partial F Tests Statistically, need to distinguish between Full regression “no better” than the restricted regression – versus – Full regression is “significantly better” than the restricted regression To do this, look at variance of prediction errors If this declines significantly, then reject H 0 From ANOVA, we know ratio of two variances has an F distribution So use F test

30 Partial F Tests SS residual = Sum of Squares Residual C = #constraints The partial F statistic has C, n-k-1 degrees of freedom Reject H 0 if F > F ,C, n-k-1

31 Coal Mining Example (Again) Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours tons unemp WWII Act Act

32 Minitab Output Predictor Coef StDev T P Constant hours tons unemp WWII Act Act S = R-Sq = 95.5% R-Sq(adj) = 94.9% Analysis of Variance Source DF SS MS F P Regression Error Total

33 Is the Overall Model Significant? H 0 :  1 =  2 =  3 = … =  6 = 0 H A : at least one  i  0 Note: for testing the overall model, C=k i.e., testing all coefficients together From the previous slides, we have SSresidual for the “full” (or unconstrained) model SSresidual=467, But what about for the restricted (H 0 true) regression? Estimate a constant only regression

34 Constant-Only Model Regression Statistics R Squared0 Adj. R Squared0 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression000.. Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept

35 Partial F Tests H 0 :  1 =  2 =  3 = … =  6 = 0 H A : at least one  i  0 Reject H 0 if F > F ,C, n-k-1 = F 0.05,6,40 = > 2.34 so reject H 0. Yes, overall model is significant =

36 Select F Distribution 5% Critical Values Numerator Degrees of Freedom … … Denominator Degrees of Freedom

37 A Small Shortcut Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours tons unemp WWII Act Act For constant only model, SS residual =10,442, So to test overall model, you don’t need to run a constant- only model

38 An Even Better Shortcut Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours tons unemp WWII Act Act In fact, the ANOVA table F test is exactly the test for the overall model being significant—recall Unit 8

39 Testing Any Subset Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp valueLower 95%Upper 95% Intercept hours tons unemp WWII Act Act Partial F test can be used to test any subset of variables For example, H 0 :  WWII =  Act1952 =  Act1969 = 0 H A : at least one  i  0

40 Restricted Model Regression Statistics R Squared0.955 Adj. R Squared0.949 Standard Error Obs.47 ANOVAdfSSMSFSignificance Regression Residual Total Coeff.Std. Errort statp value Intercept hours tons unemp Restricted regression with  WWII =  Act1952 =  Act1969 = 0

41 Partial F Tests H 0 :  WWII =  Act1952 =  Act1969 = 0 H A : at least one  i  0 Reject H 0 if F > F ,C, n-k-1 = F 0.05,3,40 = > 2.84 so reject H 0. Yes, subset of three coefficients are jointly significant = 3.950

42 Regression and Two-Way ANOVA Treatments ABC Blocks “Stack” data using dummy variables ABCB2B3B4B5Value ……

43 Recall Two-Way Results ANOVA: Two-Factor Without Replication Source of Variation SSdfMSFP- value F crit Blocks Treatment Error Total

44 Regression and Two-Way ANOVA Source | SS df MS Number of obs = F( 6, 8) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = treatment | Coef. Std. Err. t P>|t| [95% Conf. Int] b | c | b2 | b3 | b4 | b5 | _cons |

45 Regression and Two-Way ANOVA Regression Excerpt for Full Model Source | SS df MS Model | Residual | Total | Regression Excerpt for  b2 =  b3 =… 0 Source | SS df MS Model | Residual | Total | Regression Excerpt for  b =  c = 0 Source | SS df MS Model | Residual | Total | Use these SS residual values to do partial F tests and you will get exactly the same answers as the Two- Way ANOVA tests

46 Select F Distribution 5% Critical Values Numerator Degrees of Freedom …  Denominator Degrees of Freedom

47 3 Seconds of Calculus

48 Regression Coefficients y = b 0 + b 1 x (linear form) log(y) = b 0 + b 1 x (semi-log form) log(y) = b 0 + b 1 log(x) (double-log form) 1 unit change in x changes y by b 1 1 unit change in x changes y by b 1 (x100) percent 1 percent change in x changes y by b 1 percent

49 Log Regression Coefficients wage = union Predicted wage is $1.39 higher for unionized workers (on average) log(wage) = union Semi-elasticity Predicted wage is approximately 15% higher for unionized workers (on average) log(wage) = log(profits) Elasticity A one percent increase in profits increases predicted wages by approximately 0.3 percent

50 Multicollinearity Number of obs = 69 F( 2, 66) = 6.84 Prob > F = R-squared = Adj R-squared = Root MSE = repair | Coef. Std. Err. t P>|t| weight | engine | _cons | Auto repair records, weight, and engine size

51 Multicollinearity Two (or more) independent variables are so highly correlated that a multiple regression can’t disentangle the unique contributions of each Large standard errors and lack of statistical significance for individual coefficients But joint significance Identifying multicollinearity Some say “rule of thumb |r|>0.70” (or 0.80) But better to look at results OK for prediction Bad for assessing theory

52 Prediction With Multicollinearity Prediction at the Mean (weight=3019 and engine=197) Model for prediction Predicted Repair Lower 95% Limit (Mean) Upper 95% Limit (Mean) Multiple Regression Weight Only Engine Only

53 Dummy Dependent Variables Dummy dependent variables y = b 0 + b 1 x 1 + … + b k x k + e Where y is a {0,1} indicator variable Examples Do you intend to quit? yes / no Did the worker receive training? yes/no Do you think the President is doing a good job? yes/no Was there a strike? yes / no Did the company go bankrupt? yes/no

54 Linear Probability Model Mathematically / computationally, can estimate a regression as usual (the monkeys won’t know the difference) This is called a “linear probability model” Right-hand side is linear And is estimating probabilities P(y =1) = b 0 + b 1 x 1 + … + b k x k b 1 =0.15 (for example) means that a one unit change in x 1 increases probability that y=1 by 0.15 (fifteen percentage points)

55 Linear Probability Model Excel won’t know the difference, but perhaps it should Linear probability model problems  e 2 = P(y=1)  [1-P(y=1)] But P(y =1) = b 0 + b 1 x 1 + … + b k x k So  e 2 is Predicted probabilities are not bounded by 0,1 R 2 is not an accurate measure of predictive ability Can use a pseudo-R 2 measure Such as percent correctly predicted

56 Logit Model & Probit Model Solution to these problems is to use nonlinear functional forms that bound P(y=1) between 0,1 Logit Model (logistic regression) Probit Model Where  is the normal cumulative distribution function Recall, ln(x) = a when e a = x

57 Logit Model & Probit Model Nonlinear so need statistical package to do the calculations Can do individual (z-tests, not t-tests) and joint statistical testing as with other regressions Also confidence intervals Need to convert coefficients to marginal effects for interpretation Should be aware of these models Though in many cases, a linear probability model works just fine

58 Example Dep. Var: 1 if you know of the FMLA, 0 otherwise Probit estimates Number of obs = 1189 LR chi2(14) = Prob > chi2 = Log likelihood = Pseudo R2 = FMLAknow | Coef. Std. Err. z P>|z| [95% Conf. Int] union | age | agesq | nonwhite | income | incomesq | [other controls omitted] _cons |

59 Marginal Effects For numerical interpretation / prediction, need to convert coefficients to marginal effects Example: Logit Model So b 1 gives effect on Log(), not P(y=1) Probit is similar Can re-arrange to find out effect on P(y=1) Usually do this at the sample means

60 Marginal Effects Probit estimates Number of obs = 1189 LR chi2(14) = Prob > chi2 = Log likelihood = Pseudo R2 = FMLAknow | dF/dx Std. Err. z P>|z| [95% Conf. Int] union | age | agesq | Nonwhite | income | incomesq | [other controls omitted] For numerical interpretation / prediction, need to convert coefficients to marginal effects

61 But Linear Probability Model is OK, Too Probit Coeff. Union0.238 (0.101) Nonwhite (0.098) Income (0.393) Income Squared (2.853) Probit Marginal (0.040) (0.037) (0.157) (1.138) Regression (0.035) (0.033) (0.091) (0.316) So regression is usually OK, but should still be familiar with logit and probit methods