1 Multiple Regression. 2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable.

Slides:



Advertisements
Similar presentations
Topic 12: Multiple Linear Regression
Advertisements

More on understanding variance inflation factors (VIFk)
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Pengujian Parameter Regresi Pertemuan 26 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
EPI 809/Spring Probability Distribution of Random Error.
Simple Linear Regression and Correlation
Chapter 12 Simple Linear Regression
Analysis of Economic Data
Chapter 13 Multiple Regression
Chapter 12 Simple Regression
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Simple Linear Regression Basic Business Statistics 11 th Edition.
Chapter 12 Multiple Regression
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 11 th Edition.
Statistics for Business and Economics Chapter 11 Multiple Regression and Model Building.
Lecture 23 Multiple Regression (Sections )
Linear Regression Example Data
Ch. 14: The Multiple Regression Model building
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Lecture 5 Correlation and Regression
Objectives of Multiple Regression
Statistics for Business and Economics Chapter 10 Simple Linear Regression.
1 1 Slide © 2003 Thomson/South-Western Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination.
Chapter 14 Introduction to Multiple Regression
12a - 1 © 2000 Prentice-Hall, Inc. Statistics Multiple Regression and Model Building Chapter 12 part I.
STA302/ week 911 Multiple Regression A multiple regression model is a model that has more than one explanatory variable in it. Some of the reasons.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 12-1 Correlation and Regression.
Introduction to Linear Regression
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chap 14-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics.
Chap 13-1 Copyright ©2012 Pearson Education, Inc. publishing as Prentice Hall Chap 13-1 Chapter 13 Simple Linear Regression Basic Business Statistics 12.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Solutions to Tutorial 5 Problems Source Sum of Squares df Mean Square F-test Regression Residual Total ANOVA Table Variable.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
14- 1 Chapter Fourteen McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved.
Chapter 13 Multiple Regression
Simple Linear Regression (SLR)
Simple Linear Regression (OLS). Types of Correlation Positive correlationNegative correlationNo correlation.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Week 101 ANOVA F Test in Multiple Regression In multiple regression, the ANOVA F test is designed to test the following hypothesis: This test aims to assess.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 14-1 Chapter 14 Introduction to Multiple Regression Statistics for Managers using Microsoft.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Chap 13-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 13 Multiple Regression and.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Inference for regression - More details about simple linear regression IPS chapter 10.2 © 2006 W.H. Freeman and Company.
Chapter 12 Simple Linear Regression.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Business Research Methods
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Multicollinearity. Multicollinearity (or intercorrelation) exists when at least some of the predictor variables are correlated among themselves. In observational.
Lecture 11: Simple Linear Regression
Chapter 14 Introduction to Multiple Regression
Chapter 20 Linear and Multiple Regression
Chapter 15 Multiple Regression and Model Building
Multiple Regression Analysis and Model Building
Essentials of Modern Business Statistics (7e)
Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Multiple Regression Chapter 14.
Simple Linear Regression and Correlation
Chapter Fourteen McGraw-Hill/Irwin
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

1 Multiple Regression

2 Model There are many explanatory variables or independent variables x 1, x 2,…,x p that are linear related to the response variable or depended variable Y. Y ~ N( ,  where  =  0 +  1 x 1 +  2 x 2 +…+  p x p Interpret  i : When xi change one unit, then Y in mean change  i units, given that all other x- variables don’t change.

3 Parametrar Dependent variableIndependent variables Slumpvar. The model We have p independent variables that are related to the dependent variable. Y =  0 +  1 x 1 +  2 x 2 + …+  p x p + 

4 Multiple regression demonstration E(Y) =  0 +  1 x X y X2X2 1 Linear regression model with one independent variable x. Y =  0 +  1 x +  In the multiple linear regression model There are several independent variables Y =  0 +  1 x 1 +  2 x 2 +  The line becomes a plane. E(Y) =  0 +  1 x 1 +  2 x 2

Assumptions in regression X Y 1 Assumptions 3 and 4 1. There is a multiple linear relation between x variables the and Y variable. 2. The observations are independent of each other. 3. The variance around the hyper plane is the same for all combinations of x values. 4. The variance around the hyper plane can be modeled with a normal distribution.

6 –If the model assumptions are fulfilled and the model is acceptable to use then the parameter estimates can be interpret and the model can be used for predictions. –Check how well the model fits the data. –Check if the model assumptions are fulfilled by study the residual. Estimation of parameters and evaluation of the model. Work order: –Estimate the parameters with some computer program. (SPSS, SAS, R, Mintab,….)

7 –“Toulon Theatres” makes advertisements in newspapers and television. –Question: We want to understand what kind of advertisements that the better investment. –During some random weeks we observe how much is spent on advertisement, (TVAdv, NewsAdv) and income (Revenue). Model: Revenue =     TVAdv   NewsAdv  Example

8 Data material RevenueTVAdvNewsAdv 965,01,5 902,0 954,01,5 922,5 953,03,3 943,52,3 942,54,2 943,02,5 The numbers are in 1000 Euro.

9 Minitab print-out Regression Analysis: Revenue versus TVAdv; NewsAdv The regression equation is Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv Predictor Coef SE Coef T P Constant 83,230 1,574 52,88 0,000 TVAdv 2,2902 0,3041 7,53 0,001 NewsAdv 1,3010 0,3207 4,06 0,010 S = 0, R-Sq = 91,9% R-Sq(adj) = 88,7% Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500

10 Minitab out-print

11 Assumptions ok? Assumptions Normal- distribution. Constant variance Independent (If the observations are in time order)

12 Coefficient of determination, R 2 Regression Analysis: Revenue versus TVAdv; NewsAdv The regression equation is Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv Predictor Coef SE Coef T P Constant 83,230 1,574 52,88 0,000 TVAdv 2,2902 0,3041 7,53 0,001 NewsAdv 1,3010 0,3207 4,06 0,010 S = 0, R-Sq = 91,9% R-Sq(adj) = 88,7% Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500

13 Adjusted Coefficient of determination, adj R 2 Adjusted R 2 is used if to compare coefficient of determination between models with different numbers of x- variables. It is possible to show that R 2 always increase if we add more x- variables, but adj R 2 decreases if the new x-variable is weekly related to Y

14 Regression Analysis: Revenue versus TVAdv; NewsAdv The regression equation is Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv Predictor Coef SE Coef T P Constant 83,230 1,574 52,88 0,000 TVAdv 2,2902 0,3041 7,53 0,001 NewsAdv 1,3010 0,3207 4,06 0,010 S = 0, R-Sq = 91,9% R-Sq(adj) = 88,7% Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500

15 We start with the question: Is there at least one x- variable that is linear related to Y? Hypothesis H 0 :  1 =  2 = … =  p =0 (No x- variable is linear related to Y) vs H 1 : At least one  i is not zero (At least one x -variable is related to Y) Hypothesis

16 The test statistic is called F and is F- distributed with p and n-p-1 degrees of freedom. MSE=SSE/(n-p-1) MSR=SSR/p F obs SSE SSR Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500 Conclusion? P-value p n-p-1

17 We can reject H 0. We have enough evidence to state that at least one x- variable is linear related to Y. But which x- variable (or variables)? We need to look at the P-value for each regression coefficient. H 0 :  i  0 x-var no i, is not related to Y H 1 :  i  0 x-var no i, is related to Y

18 Minitab print-out Regression Analysis: Revenue versus TVAdv; NewsAdv The regression equation is Revenue = 83,2 + 2,29 TVAdv + 1,30 NewsAdv Predictor Coef SE Coef T P Constant 83,230 1,574 52,88 0,000 TVAdv 2,2902 0,3041 7,53 0,001 NewsAdv 1,3010 0,3207 4,06 0,010 S = 0, R-Sq = 91,9% R-Sq(adj) = 88,7% Analysis of Variance Source DF SS MS F P Regression 2 23,435 11,718 28,38 0,002 Residual Error 5 2,065 0,413 Total 7 25,500

19 b 0 = is the intercept. This is the expected income if no money is spent on advertisement. We have no observations when all x- variables are zero so the interpretation is an extrapolation. We need to be careful when we do extrapolations. b 1 = For each 1000 EUR we spend on television advertisement the income increase with 2290 EUR, given that the other x- variables are constant. Interpretation of the parameter estimates

20 b 2 = For each 1000 EUR we spend on newspaper advertisement the income increase with 2290 EUR, given that the other x- variables are constant. We can use the model to make predictions about the income if we spend money on television and newspaper advertisement.

Multiple Regression with nominal variables

Multiple Regression with nominal x-variables Ex: “Johansson Filtration” The company want a model for prediction repair time. (For quotation.) Y = Repair time x 1 = time since last repair x 2 = type of repair (mechanic or electrical) Nominal

Nominal x- variabels A regression with one nominal (x 3 ) and two interval (x 1 x 2 ) variables. A regression with one nominal (x 2 ) and one interval (x 1 ) variable. X1X1 Y Line for x 2 =1 Line for x 2 =0 b0b0 b 0 +b 2 x2x2 x1x1 y b3b3

Kvalitativa x-variabler (forts.) b0b0 X1X1 Y Line: x 2 = 0 and x 3 = 1 A regression with two nominal (x 2 and x 3 ) and one interval (x 1 ) variable. b 0 +b 2 b 0 +b 3 Line: x 2 = 1 and x 3 = 0 Line: x 2 = 0 and x 3 = 0 A nominal variable with k categories is represented with k-1 dummy variables Category x 2 x 3 El 0 0 Mech 1 0 Both 0 1

Example ”Johansson” Regression Analysis: Time versus Months; Type The regression equation is Time = 0, ,388 Months + 1,26 Type Predictor Coef SE Coef T P Constant 0,9305 0,4670 1,99 0,087 Months 0, , ,20 0,000 Type 1,2627 0,3141 4,02 0,005 S = 0, R-Sq = 85,9% R-Sq(adj) = 81,9% Analysis of Variance Source DF SS MS F P Regression 2 9,0009 4, ,36 0,001 Residual Error 7 1,4751 0,2107 Total 9 10,4760

Example ”Johansson” Ev. tveksamt om antaganden uppfyllda…

Example ”Johansson”

Both  1 and  2 are significant separated from zero. Both x- variables helps explain the Y- variable  0 not significant separated from zero. High coefficient of determination. Regression Analysis: Time versus Months; Type The regression equation is Time = 0, ,388 Months + 1,26 Type Predictor Coef SE Coef T P Constant 0,9305 0,4670 1,99 0,087 Months 0, , ,20 0,000 Type 1,2627 0,3141 4,02 0,005 S = 0, R-Sq = 85,9% R-Sq(adj) = 81,9% Analysis of Variance Source DF SS MS F P Regression 2 9,0009 4, ,36 0,001 Residual Error 7 1,4751 0,2107 Total 9 10,4760

b 0 = Expected repair time in hours (56 min) for a mechanical reparation of a newly repaired facility. We do a extrapolation in our interpretation. The parameter  0 is not significantly separated from zero. b 1 = For each month without service the mean repair time increase with 0.39 hours (23 min), for both kinds of repairs b 2 = If the repair is electrical the mean repair time increase with 1.26 hours (1 hour16 min) irrespective of time since last repair. Interpretation of the parameter estimations Regression Analysis: Time versus Months; Type The regression equation is Time = 0, ,388 Months + 1,26 Type Predictor Coef SE Coef T P Constant 0,9305 0,4670 1,99 0,087 Months 0, , ,20 0,000 Type 1,2627 0,3141 4,02 0,005

Example ”Johansson”

Problems in Multiple Regression

Multicollinearity problem Ideal: Each x-variable in the multiple regressions model contributes with unique information about the Y-variable. All x- variables are uncorrelated with all the others x- variables. “Worst case” (maximum Multicollinearity ): The regression model can not be estimated because the x- variables are perfect correlated with each other. Often in practice: The situation is something between ideal and worst case. The x- variables are correlated but not perfectly correlated.

Multicollinearity Illustration: x1x1 x2x2 “Worst case” All observations are in a line in. (perfect correlation) Vi can not estimate a plane. We don’t have enough information about the slope. x y Compare to OLS (one x- variable) We can’t estimate the regression line. All points are at the same x- coordinate. We got no information about the slope.

Causes for Multicollinearity Model problem: If two x- variables measure almost the same thing: Example 1: x 1 = length in cm and x 2 = length in inc Example 2: x 1 = household income and x 2 = income of the person the household with highest salary Data gathering problem: Bad luck when collecting data (or bad experimental design) results in that the correlation between the x- variables gets large.

Consequence of Multicollinearity The estimators of the regression parameters gets large variance. (No significance in the T-test. No low P-values) The size of the estimates and the sign of the estimates do not fit with the theory. No robustness. The estimates of the regression parameters changes a lot if there is minor changes in the observations or if a observation is removed or added. In some cases the F-test result is significant but none of the test results in the T-tests are significant..

How to discover Multicollinearity Correlation matrix x- variables. Example: Y = house living space x 1 = disposable income x 2 = Family size high correlation between the variables Variance Inflation Factor (VIF) – Rule of thumb VIF>10 means problems with Multicollinearity Correlations: Disp.Inkomst; Storlek Pearson correlation of Disp.Inkomst and Storlek = 0,978 P-Value = 0,000

Example Regression Analysis: Boyta versus Disp.Inkomst; Storlek The regression equation is Boyta = - 11,5 + 0,568 Disp.Inkomst + 3,4 Storlek Predictor Coef SE Coef T P VIF Constant -11,52 70,45 -0,16 0,878 Disp.Inkomst 0,5681 0,5504 1,03 0,360 22,750 Storlek 3,43 17,41 0,20 0,853 22,750 S = 16,0924 R-Sq = 89,5% R-Sq(adj) = 84,3% Analysis of Variance Source DF SS MS F P Regression ,9 4424,9 17,09 0,011 Residual Error ,9 259,0 Total ,7 F-test significant but none of the T-tests! VIF = > 10!

Countermeasure against Multicollinearity Model problem: Remove one x- variable at the time until the problem disappears. Data collection problem: Try to make more observations perfectly in a “new x- variable area.”

If time Example with dummy variable

State finance in war and peace. We want to examine if public purchase of premium bond (x) is related to the national income Y. Data: Yearly registrations of the variables in Canada during 1933 to 1949

Observationer ÅryxD 19332,62, ,02, ,63, ,73, ,83, ,14, ,44, ,15, ,06,31 ÅryxD 19428,98, ,78, ,29, ,19, ,99, ,710, ,112, ,112,90

Dummy variable D is a dummy variable D = 1 if Canada in war 0 if Canada in peace

Regression Analysis: y versus x (Whitout the dummy) The regression equation is y = 1,57 + 0,759 x Predictor Coef SE Coef T P Constant 1,5698 0,6337 2,48 0,026 x 0, , ,14 0,000 S = 1,15623 R-Sq = 84,8% R-Sq(adj) = 83,8% Analysis of Variance Source DF SS MS F P Regression 1 111,71 111,71 83,56 0,000 Residual Error 15 20,05 1,34 Total ,76

Residuals vs. Estimated values

Regression Analysis: y versus x; D (Whit the dummy) The regression equation is y = 1,29 + 0,681 x + 2,30 D Predictor Coef SE Coef T P Constant 1,2897 0, ,16 0,000 x 0, , ,99 0,000 D 2,3044 0, ,06 0,000 S = 0, R-Sq = 99,5% R-Sq(adj) = 99,5% Analysis of Variance Source DF SS MS F P Regression 2 131,145 65, ,92 0,000 Residual Error 14 0,614 0,044 Total ,759

Residuals vs. Estimated values