Part 7: Multiple Regression Analysis 7-1/54 Regression Models Professor William Greene Stern School of Business IOMS Department Department of Economics
Part 7: Multiple Regression Analysis 7-2/54 Regression and Forecasting Models Part 7 – Multiple Regression Analysis
Part 7: Multiple Regression Analysis 7-3/54 Model Assumptions y i = β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 … + β K x iK + ε i β 0 + β 1 x i1 + β 2 x i2 + β 3 x i3 … + β K x iK is the ‘regression function’ Contains the ‘information’ about y i in x i1, …, x iK Unobserved because β 0,β 1,…, β K are not known for certain ε i is the ‘disturbance.’ It is the unobserved random component Observed y i is the sum of the two unobserved parts.
Part 7: Multiple Regression Analysis 7-4/54 Regression Model Assumptions About ε i Random Variable (1) The regression is the mean of y i for a particular x i1, …, x iK. ε i is the deviation of y i from the regression line. (2) ε i has mean zero. (3) ε i has variance σ 2. ‘Random’ Noise (4) ε i is unrelated to any values of x i1, …, x iK (no covariance) – it’s “random noise” (5) ε i is unrelated to any other observations on ε j (not “autocorrelated”) (6) Normal distribution - ε i is the sum of many small influences
Part 7: Multiple Regression Analysis 7-5/54 Regression model for U.S. gasoline market, y x1 x2 x3 x4 x5
Part 7: Multiple Regression Analysis 7-6/54 Least Squares
Part 7: Multiple Regression Analysis 7-7/54 An Elaborate Multiple Loglinear Regression Model
Part 7: Multiple Regression Analysis 7-8/54 An Elaborate Multiple Loglinear Regression Model Specified Equation
Part 7: Multiple Regression Analysis 7-9/54 An Elaborate Multiple Loglinear Regression Model Minimized sum of squared residuals
Part 7: Multiple Regression Analysis 7-10/54 An Elaborate Multiple Loglinear Regression Model Least Squares Coefficients
Part 7: Multiple Regression Analysis 7-11/54 An Elaborate Multiple Loglinear Regression Model N=52 K=5
Part 7: Multiple Regression Analysis 7-12/54 An Elaborate Multiple Loglinear Regression Model Standard Errors
Part 7: Multiple Regression Analysis 7-13/54 An Elaborate Multiple Loglinear Regression Model Confidence Intervals b k t* SE logIncome 2.013(.1457) = [ to ]
Part 7: Multiple Regression Analysis 7-14/54 An Elaborate Multiple Loglinear Regression Model t statistics for testing individual slopes = 0
Part 7: Multiple Regression Analysis 7-15/54 An Elaborate Multiple Loglinear Regression Model P values for individual tests
Part 7: Multiple Regression Analysis 7-16/54 An Elaborate Multiple Loglinear Regression Model Standard error of regression s e
Part 7: Multiple Regression Analysis 7-17/54 An Elaborate Multiple Loglinear Regression Model R2R2
Part 7: Multiple Regression Analysis 7-18/54 We used McDonald’s Per Capita
Part 7: Multiple Regression Analysis 7-19/54 Movie Madness Data (n=2198)
Part 7: Multiple Regression Analysis 7-20/54 CRIME is the left out GENRE. AUSTRIA is the left out country. Australia and UK were left out for other reasons (algebraic problem with only 8 countries).
Part 7: Multiple Regression Analysis 7-21/54 Use individual “T” statistics. T > +2 or T < -2 suggests the variable is “significant.” T for LogPCMacs = This is large.
Part 7: Multiple Regression Analysis 7-22/54 Partial Effect Hypothesis: If we include the signature effect, size does not explain the sale prices of Monet paintings. Test: Compute the multiple regression; then H 0 : β 1 = 0. α level for the test = 0.05 as usual Rejection Region: Large value of b 1 (coefficient) Test based on t = b 1 /StandardError Regression Analysis: ln (US$) versus ln (SurfaceArea), Signed The regression equation is ln (US$) = ln (SurfaceArea) Signed Predictor Coef SE Coef T P Constant ln (SurfaceArea) Signed S = R-Sq = 46.2% R-Sq(adj) = 46.0% Reject H 0. Degrees of Freedom for the t statistic is N-3 = N-number of predictors – 1.
Part 7: Multiple Regression Analysis 7-23/54 Model Fit How well does the model fit the data? R 2 measures fit – the larger the better Time series: expect.9 or better Cross sections: it depends Social science data:.1 is good Industry or market data:.5 is routine
Part 7: Multiple Regression Analysis 7-24/54 Two Views of R 2
Part 7: Multiple Regression Analysis 7-25/54 Pretty Good Fit: R 2 =.722 Regression of Fuel Bill on Number of Rooms
Part 7: Multiple Regression Analysis 7-26/54 Testing “The Regression” Degrees of Freedom for the F statistic are K and N-K-1
Part 7: Multiple Regression Analysis 7-27/54 A Formal Test of the Regression Model Is there a significant “relationship?” Equivalently, is R 2 > 0? Statistically, not numerically. Testing: Compute Determine if F is large using the appropriate “table”
Part 7: Multiple Regression Analysis 7-28/54 n 1 = Number of predictors n 2 = Sample size – number of predictors – 1
Part 7: Multiple Regression Analysis 7-29/54 An Elaborate Multiple Loglinear Regression Model R2R2
Part 7: Multiple Regression Analysis 7-30/54 An Elaborate Multiple Loglinear Regression Model Overall F test for the model
Part 7: Multiple Regression Analysis 7-31/54 An Elaborate Multiple Loglinear Regression Model P value for overall F test
Part 7: Multiple Regression Analysis 7-32/54 Cost “Function” Regression The regression is “significant.” F is huge. Which variables are significant? Which variables are not significant?
Part 7: Multiple Regression Analysis 7-33/54 The F Test for the Model Determine the appropriate “critical” value from the table. Is the F from the computed model larger than the theoretical F from the table? Yes: Conclude the relationship is significant No: Conclude R 2 = 0.
Part 7: Multiple Regression Analysis 7-34/54 Compare Sample F to Critical F F = for More Movie Madness Critical value from the table is Reject the hypothesis of no relationship.
Part 7: Multiple Regression Analysis 7-35/54 An Equivalent Approach What is the “P Value?” We observed an F of (or, whatever it is). If there really were no relationship, how likely is it that we would have observed an F this large (or larger)? Depends on N and K The probability is reported with the regression results as the P Value.
Part 7: Multiple Regression Analysis 7-36/54 The F Test for More Movie Madness S = R-Sq = 57.0% R-Sq(adj) = 56.6% Analysis of Variance Source DF SS MS F P Regression Residual Error Total
Part 7: Multiple Regression Analysis 7-37/54 What About a Group of Variables? Is Genre significant? There are 12 genre variables Some are “significant” (fantasy, mystery, horror) some are not. Can we conclude the group as a whole is? Maybe. We need a test.
Part 7: Multiple Regression Analysis 7-38/54 Application: Part of a Regression Model Regression model includes variables x 1, x 2,… I am sure of these variables. Maybe variables z 1, z 2,… I am not sure of these. Model: y = β 0 +β 1 x 1 +β 2 x 2 + δ 1 z 1 +δ 2 z 2 + ε Hypothesis: δ 1 =0 and δ 2 =0. Strategy: Start with model including x 1 and x 2. Compute R 2. Compute new model that also includes z 1 and z 2. Rejection region: R 2 increases a lot.
Part 7: Multiple Regression Analysis 7-39/54 Theory for the Test A larger model has a higher R 2 than a smaller one. (Larger model means it has all the variables in the smaller one, plus some additional ones) Compute this statistic with a calculator
Part 7: Multiple Regression Analysis 7-40/54 Test Statistic
Part 7: Multiple Regression Analysis 7-41/54 Gasoline Market
Part 7: Multiple Regression Analysis 7-42/54 Gasoline Market Regression Analysis: logG versus logIncome, logPG The regression equation is logG = logIncome logPG Predictor Coef SE Coef T P Constant logIncome logPG S = R-Sq = 93.6% R-Sq(adj) = 93.4% Analysis of Variance Source DF SS MS F P Regression Residual Error Total R 2 = / =
Part 7: Multiple Regression Analysis 7-43/54 Gasoline Market Regression Analysis: logG versus logIncome, logPG,... The regression equation is logG = logIncome logPG logPNC logPUC logPPT Predictor Coef SE Coef T P Constant logIncome logPG logPNC logPUC logPPT S = R-Sq = 96.0% R-Sq(adj) = 95.6% Analysis of Variance Source DF SS MS F P Regression Residual Error Total Now, R 2 = / = Previously, R 2 = / =
Part 7: Multiple Regression Analysis 7-44/54 Improvement in R 2 Inverse Cumulative Distribution Function F distribution with 3 DF in numerator and 46 DF in denominator P( X <= x ) = 0.95 x = The null hypothesis is rejected. Notice that none of the three individual variables are “significant” but the three of them together are.
Part 7: Multiple Regression Analysis 7-45/54 Is Genre Significant? Calc -> Probability Distributions -> F… The critical value shown by Minitab is 1.76 With the 12 Genre indicator variables: R-Squared = 57.0% Without the 12 Genre indicator variables: R-Squared = 55.4% The F statistic is F is greater than the critical value. Reject the hypothesis that all the genre coefficients are zero.
Part 7: Multiple Regression Analysis 7-46/54 Application Health satisfaction depends on many factors: Age, Income, Children, Education, Marital Status Do these factors figure differently in a model for women compared to one for men? Investigation: Multiple regression Null hypothesis: The regressions are the same. Rejection Region: Estimated regressions that are very different.
Part 7: Multiple Regression Analysis 7-47/54 Equal Regressions Setting: Two groups of observations (men/women, countries, two different periods, firms, etc.) Regression Model: y = β 0 +β 1 x 1 +β 2 x 2 + … + ε Hypothesis: The same model applies to both groups Rejection region: Large values of F
Part 7: Multiple Regression Analysis 7-48/54 Procedure: Equal Regressions There are N1 observations in Group 1 and N2 in Group 2. There are K variables and the constant term in the model. This test requires you to compute three regressions and retain the sum of squared residuals from each: SS1 = sum of squares from N 1 observations in group 1 SS2 = sum of squares from N 2 observations in group 2 SSALL = sum of squares from N ALL =N 1 +N 2 observations when the two groups are pooled. The hypothesis of equal regressions is rejected if F is larger than the critical value from the F table (K numerator and N ALL -2K-2 denominator degrees of freedom)
Part 7: Multiple Regression Analysis 7-49/ |Variable| Coefficient | Standard Error | T |P value]| Mean of X| Women===|=[NW = 13083]================================================ Constant| AGE | EDUC | HHNINC | HHKIDS | MARRIED | Men=====|=[NM = 14243]================================================ Constant| AGE | EDUC | HHNINC | HHKIDS | MARRIED | Both====|=[NALL = 27326]============================================== Constant| AGE | EDUC | HHNINC | HHKIDS | MARRIED | German survey data over 7 years, 1984 to 1991 (with a gap). 27,326 observations on Health Satisfaction and several covariates. Health Satisfaction Models: Men vs. Women
Part 7: Multiple Regression Analysis 7-50/54 Computing the F Statistic | Women Men All | | HEALTH Mean = | | Standard deviation = | | Number of observs. = | | Model size Parameters = | | Degrees of freedom = | | Residuals Sum of squares = | | Standard error of e = | | Fit R-squared = | | Model test F (P value) = (.000) (.000) (.0000) |
Part 7: Multiple Regression Analysis 7-51/54 A Huge Theorem R 2 always goes up when you add variables to your model. Always.
Part 7: Multiple Regression Analysis 7-52/54 The Adjusted R Squared Adjusted R 2 penalizes your model for obtaining its fit with lots of variables. Adjusted R 2 = 1 – [(N-1)/(N-K-1)]*(1 – R 2 ) Adjusted R 2 is denoted Adjusted R 2 is not the mean of anything and it is not a square. This is just a name.
Part 7: Multiple Regression Analysis 7-53/54 An Elaborate Multiple Loglinear Regression Model Adjusted R 2
Part 7: Multiple Regression Analysis 7-54/54 Adjusted R 2 for More Movie Madness S = R-Sq = 57.0% R-Sq(adj) = 56.6% Analysis of Variance Source DF SS MS F P Regression Residual Error Total If N is very large, R 2 and Adjusted R 2 will not differ by very much is quite large for this purpose.