Statistics for Business and Economics Chapter 11 Multiple Regression and Model Building
Learning Objectives 1.Explain the Linear Multiple Regression Model 2.Describe Inference About Individual Parameters 3.Test Overall Significance 4.Explain Estimation and Prediction 5.Describe Various Types of Models 6.Describe Model Building 7.Explain Residual Analysis 8.Describe Regression Pitfalls
Types of Regression Models Simple 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Linear Non- Linear Non- Linear
Models With Two or More Quantitative Variables
Types of Regression Models Simple 1 Explanatory Variable Regression Models 2+ Explanatory Variables Multiple Linear Non- Linear Non- Linear
Multiple Regression Model General form: k independent variables x 1, x 2, …, x k may be functions of variables —e.g. x 2 = (x 1 ) 2
Regression Modeling Steps 1.Hypothesize deterministic component 2.Estimate unknown model parameters 3.Specify probability distribution of random error term Estimate standard deviation of error 4.Evaluate model 5.Use model for prediction and estimation
Probability Distribution of Random Error
Regression Modeling Steps 1.Hypothesize deterministic component 2.Estimate unknown model parameters 3.Specify probability distribution of random error term Estimate standard deviation of error 4.Evaluate model 5.Use model for prediction and estimation
Assumptions for Probability Distribution of ε 1.Mean is 0 2.Constant variance, σ 2 3.Normally Distributed 4.Errors are independent
Linear Multiple Regression Model
Types of Regression Models Explanatory Variable 1st Order Model 3rd Order Model 2 or More Quantitative Variables 2nd Order Model 1st Order Model 2nd Order Model Inter- Action Model 1 Qualitative Variable Dummy Variable Model 1 Quantitative Variable
Regression Modeling Steps 1.Hypothesize deterministic component 2.Estimate unknown model parameters 3.Specify probability distribution of random error term Estimate standard deviation of error 4.Evaluate model 5.Use model for prediction and estimation
First–Order Multiple Regression Model Relationship between 1 dependent and 2 or more independent variables is a linear function Dependent (response) variable Independent (explanatory) variables Population slopes Population Y-intercept Random error
Relationship between 1 dependent and 2 independent variables is a linear function Model Assumes no interaction between x 1 and x 2 —Effect of x 1 on E(y) is the same regardless of x 2 values First-Order Model With 2 Independent Variables
y x1x1 0 Response Plane (Observed y) i Population Multiple Regression Model Bivariate model: x2x2 (x 1i, x 2i )
Sample Multiple Regression Model y 0 Response Plane (Observed y) i ^ ^ Bivariate model: x1x1 x2x2 (x 1i, x 2i )
No Interaction Effect (slope) of x 1 on E(y) does not depend on x 2 value E(y) x1x E(y) = 1 + 2x 1 + 3(2) = 7 + 2x 1 E(y) = 1 + 2x 1 + 3x 2 E(y) = 1 + 2x 1 + 3(1) = 4 + 2x 1 E(y) = 1 + 2x 1 + 3(0) = 1 + 2x 1 E(y) = 1 + 2x 1 + 3(3) = x 1
Parameter Estimation
Regression Modeling Steps 1.Hypothesize Deterministic Component 2.Estimate Unknown Model Parameters 3.Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4.Evaluate Model 5.Use Model for Prediction & Estimation
:::: First-Order Model Worksheet Run regression with y, x 1, x 2 Case, iyiyi x1ix1i x2ix2i
Multiple Linear Regression Equations Too complicated by hand! Ouch!
Interpretation of Estimated Coefficients 2.Y-Intercept ( 0 ) Average value of y when x k = 0 ^ ^ ^ ^ 1.Slope ( k ) Estimated y changes by k for each 1 unit increase in x k holding all other variables constant –Example: if 1 = 2, then sales (y) is expected to increase by 2 for each 1 unit increase in advertising (x 1 ) given the number of sales rep’s (x 2 )
1 st Order Model Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.) and newspaper circulation (000) on the number of ad responses (00). Estimate the unknown parameters. You’ve collected the following data: (y) (x 1 ) (x 2 ) RespSizeCirc
Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Param=0 Prob>|T| INTERCEP ADSIZE CIRC Parameter Estimation Computer Output 22 ^ 00 ^ 11 ^
Interpretation of Coefficients Solution^ 2.Slope ( 2 ) Number of responses to ad is expected to increase by.2805 (28.05) for each 1 unit (1,000) increase in circulation holding ad size constant ^ 1.Slope ( 1 ) Number of responses to ad is expected to increase by.2049 (20.49) for each 1 sq. in. increase in ad size holding circulation constant
Estimation of σ 2
Regression Modeling Steps 1.Hypothesize Deterministic Component 2.Estimate Unknown Model Parameters 3.Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4.Evaluate Model 5.Use Model for Prediction & Estimation
Estimation of σ 2 For a model with k independent variables
Calculating s 2 and s Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x 1, and newspaper circulation (000), x 2, on the number of ad responses (00), y. Find SSE, s 2, and s.
Analysis of Variance Source DF SS MS F P Regression Residual Error Total Analysis of Variance Computer Output SSE S2S2
Evaluating the Model
Regression Modeling Steps 1.Hypothesize Deterministic Component 2.Estimate Unknown Model Parameters 3.Specify Probability Distribution of Random Error Term Estimate Standard Deviation of Error 4.Evaluate Model 5.Use Model for Prediction & Estimation
Evaluating Multiple Regression Model Steps 1.Examine variation measures 2.Test parameter significance Individual coefficients Overall model 3.Do residual analysis
Variation Measures
Evaluating Multiple Regression Model Steps 1.Examine variation measures 2.Test parameter significance Individual coefficients Overall model 3.Do residual analysis
Multiple Coefficient of Determination Proportion of variation in y ‘explained’ by all x variables taken together Never decreases when new x variable is added to model —Only y values determine SS yy —Disadvantage when comparing models
Adjusted Multiple Coefficient of Determination Takes into account n and number of parameters Similar interpretation to R 2
Estimation of R 2 and R a 2 Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x 1, and newspaper circulation (000), x 2, on the number of ad responses (00), y. Find R 2 and R a 2.
Excel Computer Output Solution R2R2 Ra2Ra2
Testing Parameters
Evaluating Multiple Regression Model Steps 1.Examine variation measures 2.Test parameter significance Individual coefficients Overall model 3.Do residual analysis
Inference for an Individual β Parameter Confidence Interval Hypothesis Test Ho: β i = 0 Ha: β i ≠ 0 (or ) Test Statistic df = n – (k + 1)
Confidence Interval Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x 1, and newspaper circulation (000), x 2, on the number of ad responses (00), y. Find a 95% confidence interval for β 1.
Excel Computer Output Solution
Confidence Interval Solution
Hypothesis Test Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x 1, and newspaper circulation (000), x 2, on the number of ad responses (00), y. Test the hypothesis that the mean ad response increases as circulation increases (ad size constant). Use α =.05.
Hypothesis Test Solution H 0 : H a : df Critical Value(s): Test Statistic: Decision: Conclusion: t Reject H 0 2 = 0 2 = 3
Excel Computer Output Solution
Hypothesis Test Solution H 0 : H a : df Critical Value(s): Test Statistic: Decision: Conclusion: t Reject H 0 2 = 0 2 = 3 Reject at =.05 There is evidence the mean ad response increases as circulation increases
Excel Computer Output Solution P–Value
Evaluating Multiple Regression Model Steps 1.Examine variation measures 2.Test parameter significance Individual coefficients Overall model 3.Do residual analysis
Testing Overall Significance Shows if there is a linear relationship between all x variables together and y Hypotheses — H 0 : 1 = 2 =... = k = 0 No linear relationship — H a : At least one coefficient is not 0 At least one x variable affects y
Test Statistic Degrees of Freedom 1 = k 2 = n – (k + 1) —k = Number of independent variables —n = Sample size Testing Overall Significance
Testing Overall Significance Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x 1, and newspaper circulation (000), x 2, on the number of ad responses (00), y. Conduct the global F–test of model usefulness. Use α =.05.
Testing Overall Significance Solution H 0 : H a : = 1 = 2 = Critical Value(s): Test Statistic: Decision: Conclusion: F =.05 β 1 = β 2 = 0 At least 1 not zero
Testing Overall Significance Computer Output Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model Error C Total k n – (k + 1) MS(Error) MS(Model)
Testing Overall Significance Solution H 0 : H a : = 1 = 2 = Critical Value(s): Test Statistic: Decision: Conclusion: F =.05 β 1 = β 2 = 0 At least 1 not zero Reject at =.05 There is evidence at least 1 of the coefficients is not zero
Testing Overall Significance Computer Output Solution Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model Error C Total P-Value MS(Model) MS(Error)
Interaction Models
Types of Regression Models Explanatory Variable 1st Order Model 3rd Order Model 2 or More Quantitative Variables 2nd Order Model 1st Order Model 2nd Order Model Inter- Action Model 1 Qualitative Variable Dummy Variable Model 1 Quantitative Variable
Interaction Model With 2 Independent Variables Hypothesizes interaction between pairs of x variables —Response to one x variable varies at different levels of another x variable Contains two-way cross product terms Can be combined with other models —Example: dummy-variable model
Effect of Interaction Given: Without interaction term, effect of x 1 on y is measured by 1 With interaction term, effect of x 1 on y is measured by 1 + 3 x 2 —Effect increases as x 2 increases
Interaction Model Relationships Effect (slope) of x 1 on E(y) depends on x 2 value E(y) x1x E(y) = 1 + 2x 1 + 3x 2 + 4x 1 x 2 E(y) = 1 + 2x 1 + 3(1) + 4x 1 (1) = 4 + 6x 1 E(y) = 1 + 2x 1 + 3(0) + 4x 1 (0) = 1 + 2x 1
Interaction Model Worksheet ::::: Multiply x 1 by x 2 to get x 1 x 2. Run regression with y, x 1, x 2, x 1 x 2 Case, iyiyi x1ix1i x2ix2i x 1i x 2i
Interaction Example You work in advertising for the New York Times. You want to find the effect of ad size (sq. in.), x 1, and newspaper circulation (000), x 2, on the number of ad responses (00), y. Conduct a test for interaction. Use α =.05.
Interaction Model Worksheet Multiply x 1 by x 2 to get x 1 x 2. Run regression with y, x 1, x 2, x 1 x 2 x1ix1i x2ix2i x 1i x 2i yiyi
Excel Computer Output Solution Global F–test indicates at least one parameter is not zero P-Value F
Interaction Test Solution H 0 : H a : df Critical Value(s): Test Statistic: Decision: Conclusion: t Reject H 3 = 0 3 ≠ = 4
Excel Computer Output Solution
Interaction Test Solution H 0 : H a : df Critical Value(s): Test Statistic: Decision: Conclusion: t Reject H 3 = 0 3 ≠ = 4 Do no reject at =.05 There is no evidence of interaction t =
Second–Order Models
Types of Regression Models Explanatory Variable 1st Order Model 3rd Order Model 2 or More Quantitative Variables 2nd Order Model 1st Order Model 2nd Order Model Inter- Action Model 1 Qualitative Variable Dummy Variable Model 1 Quantitative Variable
Second-Order Model With 1 Independent Variable Relationship between 1 dependent and 1 independent variable is a quadratic function Useful 1 st model if non-linear relationship suspected Model Linear effect Curvilinear effect
y Second-Order Model Relationships yy y 2 > 0 2 < 0 x1x1 x1x1 x1x1 x1x1
Second-Order Model Worksheet Create x 2 column. Run regression with y, x, x :::: Case, iyiyi xixi 2 xixi
2 nd Order Model Example The data shows the number of weeks employed and the number of errors made per day for a sample of assembly line workers. Find a 2 nd order model, conduct the global F–test, and test if β 2 ≠ 0. Use α =.05 for all tests. Errors (y) Weeks (x)
Second-Order Model Worksheet Create x 2 column. Run regression with y, x, x yiyi xixi 2 xixi :::
Excel Computer Output Solution
Overall Model Test Solution Global F–test indicates at least one parameter is not zero P-Value F
β 2 Parameter Test Solution β 2 test indicates curvilinear relationship exists P-Value t
Types of Regression Models Explanatory Variable 1st Order Model 3rd Order Model 2 or More Quantitative Variables 2nd Order Model 1st Order Model 2nd Order Model Inter- Action Model 1 Qualitative Variable Dummy Variable Model 1 Quantitative Variable
Second-Order Model With 2 Independent Variables Relationship between 1 dependent and 2 independent variables is a quadratic function Useful 1 st model if non-linear relationship suspected Model
Second-Order Model Relationships y x2x2 x1x1 4 + 5 > 0 y 4 + 5 < 0 y 3 2 > 4 4 5 x2x2 x1x1 x2x2 x1x1
Second-Order Model Worksheet ::::::: Multiply x 1 by x 2 to get x 1 x 2 ; then create x 1 2, x 2 2. Run regression with y, x 1, x 2, x 1 x 2, x 1 2, x 2 2. Case, iyiyi x1ix1i 2 x1ix1i 2 x2ix2i x2ix2i x1ix2ix1ix2i
Models With One Qualitative Independent Variable
Types of Regression Models Explanatory Variable 1st Order Model 3rd Order Model 2 or More Quantitative Variables 2nd Order Model 1st Order Model 2nd Order Model Inter- Action Model 1 Qualitative Variable Dummy Variable Model 1 Quantitative Variable
Dummy-Variable Model Involves categorical x variable with 2 levels —e.g., male-female; college-no college Variable levels coded 0 and 1 Number of dummy variables is 1 less than number of levels of variable May be combined with quantitative variable (1 st order or 2 nd order model)
Dummy-Variable Model Worksheet :::: x 2 levels: 0 = Group 1; 1 = Group 2. Run regression with y, x 1, x 2 Case, iyiyi x1ix1i x2ix2i
Interpreting Dummy- Variable Model Equation Given: y = Starting salary of college graduates x 1 = GPA 0 if Male 1 if Female Same slopes x 2 = Male ( x 2 = 0 ): Female ( x 2 = 1 ):
Dummy-Variable Model Example Computer Output: Same slopes 0 if Male 1 if Female x 2 = Male ( x 2 = 0 ): Female ( x 2 = 1 ):
Dummy-Variable Model Relationships y x1x1 0 0 Same Slopes 1 00 0 + 2 ^ ^ ^ ^ Female Male
Nested Models
Comparing Nested Models Contains a subset of terms in the complete (full) model Tests the contribution of a set of x variables to the relationship with y Null hypothesis H 0 : g+1 =... = k = 0 —Variables in set do not improve significantly the model when all other variables are included Used in selecting x variables or models Part of most computer programs
Selecting Variables in Model Building
A butterfly flaps its wings in Japan, which causes it to rain in Nebraska. -- Anonymous Use Theory Only!Use Computer Search!
Model Building with Computer Searches Rule: Use as few x variables as possible Stepwise Regression —Computer selects x variable most highly correlated with y —Continues to add or remove variables depending on SSE Best subset approach —Computer examines all possible sets
Residual Analysis
Evaluating Multiple Regression Model Steps 1.Examine variation measures 2.Test parameter significance Individual coefficients Overall model 3.Do residual analysis
Residual Analysis Graphical analysis of residuals —Plot estimated errors versus x i values Difference between actual y i and predicted y i Estimated errors are called residuals —Plot histogram or stem-&-leaf of residuals Purposes —Examine functional form (linear v. non-linear model) —Evaluate violations of assumptions
Residual Plot for Functional Form Add x 2 TermCorrect Specification x e ^ x e ^
Residual Plot for Equal Variance Unequal VarianceCorrect Specification Fan-shaped. Standardized residuals used typically. x e ^ x e ^
Residual Plot for Independence x Not IndependentCorrect Specification Plots reflect sequence data were collected. e ^ x e ^
Residual Analysis Computer Output Dep Var Predict Student Obs SALES Value Residual Residual | |** | | *| | | | | | **| | | |*** | Plot of standardized (student) residuals
Regression Pitfalls
Parameter Estimability —Number of different x–values must be at least one more than order of model Multicollinearity —Two or more x–variables in the model are correlated Extrapolation —Predicting y–values outside sampled range Correlated Errors
Multicollinearity High correlation between x variables Coefficients measure combined effect Leads to unstable coefficients depending on x variables in model Always exists – matter of degree Example: using both age and height as explanatory variables in same model
Detecting Multicollinearity Significant correlations between pairs of x variables are more than with y variable Non–significant t–tests for most of the individual parameters, but overall model test is significant Estimated parameters have wrong sign
Solutions to Multicollinearity Eliminate one or more of the correlated x variables Avoid inference on individual parameters Do not extrapolate
y Interpolation x Extrapolation Sampled Range Extrapolation
Conclusion 1.Explained the Linear Multiple Regression Model 2.Described Inference About Individual Parameters 3.Tested Overall Significance 4.Explained Estimation and Prediction 5.Described Various Types of Models 6.Described Model Building 7.Explained Residual Analysis 8.Described Regression Pitfalls