Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 6: Multiple Regression Model Building Priyantha.

Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 6: Multiple Regression Model Building Priyantha Wijayatunga, Department of Statistics, Umeå University priyantha.wijayatunga@stat.umu.se These materials are altered ones from copyrighted lecture slides (© 2009 W.H. Freeman and Company) from the homepage of the book: The Practice of Business Statistics Using Data for Decisions :Second Edition by Moore, McCabe, Duckworth and Alwan.

Multiple Linear Regression Model Building Reference to the book: Chapter 11.3  Model building  Models for curved relationships  Models with categorical explanatory variables  Variable selection methods

Model Building  Often we have many explanatory variables. A model using just a few of the variables often predicts about as well as a model using all of the explanatory variables.  We may also find that the reciprocal of a variable is a better choice than the variable itself, or that including the square of a variable improves prediction.  How can we find a good model? That is the model building issue  In regression modeling  It can cover a variety of mathematical models  linear relationships.  non-linear relationships.  nominal independent variables.  It provides efficient methods for model building

Earlier: RegressionX,Y interval data CorrelationX,Y interval or ordinal data Now: Independnet and dependent variable relation can be non–linear RegressionY interval data X interval or nominal data

Prices of Homes Homes for sale in zip code 47904: Response variable is Price.

Price and square feet Plot of Price versus Square Feet: The relationship is approximately linear, but curves up somewhat for the higher priced homes. Note: We excluded 7 homes with price >$150,000 and SqFt > 1800

Regression of price on square feet : The fitted model is predicted Price = 45,298 + 34.32 SqFt The coefficient for SqFt is statistically significant (P < 0.0001). Each additional square foot of area raises selling price by $34.32 on average. 37.3% of the variation is explained by a linear relationship with SqFt.

Models for curved relationships  The scatterplot suggests the relationship between square foot and price may be slightly curved.  One simple kind of curved relationship is a quadratic function.  The model is: y =  0 +  1 x +  2 x 2 + 

Quadratic regression of price on square feet : The fitted model is predicted Price = 81,273 - 30.14 SqFt + 0.0271SqFt2 where SqFt2 is the square of SqFt. The coefficient for SqFt2 is not statistically significant (P = 0.41). 38.6% of the variation in Price is explained by this model. We conclude that adding SqFt2 to our model is not helpful.

P th –order polynomial model of one predictor variable 2 nd order polynomial model of two predictor variables 2 nd order polynomial model of two predictor variables with iteraction

y =  0 +  1 x 1 +  2 x 2 +…+  p x p +  y =  0 +  1 x +  2 x 2 + …+  p x p +  Polynomial Models with One Predictor Variable

y  0  1 x    First order model (p = 1)  Second order model (p=2)  2 < 0  2 > 0 Polynomial Models with One Predictor Variable y =  0 +  1 x +  2 x 2 + 

+  3 x 3 +   3 < 0  3 > 0 Third order model (p = 3) Polynomial Models with One Predictor Variable 13

Nominal Independent Variables  In many real-life situations one or more independent variables are nominal.  Including nominal variables in a regression analysis model is done via indicator variables.  An indicator variable (I) or dummy variable can assume one out of two values, “zero” or “one”. 1 if the first condition out of two is met 0 if the second condition out of two is met I=

Water Filter Service A water and sewage company, provides services on water filter systems. To estimate the cost for a service job, a model for how much time a repair taking is analysed. The following data was collected. ServiceMonths since last repair Kind of repairTime for repair (hours) 12electronic2,9 26mechanic3,0 38electronic4,8 43mechanic1,8 52electronic2,9 67electronic4,9 79mechanic4,2 88mechanic4,8 94electronic4,4 106electronic4,5

Our model Y = Time for repair (hours) X 1 = Months since last repair 0 if an electronic repair I = 1 if an mechanic repair Note: for a nominal variable with m categories we need m-1 indicator variables.

I 2 = 1 if region A 0 otherwise I 3 = 1 if region B 0 otherwise The category “region C” is defined by: I 2 = 0; I 3 = 0 The Region C is called the omitted category Now assume that we instead of ”type of repair” had a variable for different regions, let’s say A, B, and C.

Our model

Interpreting Regression Coefficients Multiple Regression: A coefficient is interpreted by holding the other variables constant Now, Interpretation of and are as usual When X 1 is held constant, Y for the Region A is on average number of hours more than that for Region C. When X 1 is held constant, Y for the Region B is on average number of hours more than that for Region C.

T-test for the parameters In our example, one can test H 0 :  2 = 0 H 1 :  2 ≠ 0 Research hypothesis: The time for service in region A is different than the time for service in region C. Or for an other hypothesis test: H 0 :  3 = 0 H 1 :  3 > 0 Research hypothesis: The time for service in region B is longer than the time for service in region C.

Example A used car price (in thousands) is believed to be related with number of miles in the odometer (in thousands) and the color of the car where most popular colors are white and silver. In order to build the model a random sample of 100 used cars that were sold out in auctions during the last month were selected. Price Odometer Color 14.637.4 white 14.144.8 white 14.045.8 other 15.630.9 other 15.631.7 silver 14.734.0 silver.....

I 2 = 1 if color is white 0 otherwise I 3 = 1 if color is silver 0 otherwise The other colors are defined by: I 2 = 0; I 3 = 0 The ”other colors” is called the omitteted category

Now our data look like this Price Odometer I_2 I_3 Our Model is: 14.637.4 1.0 0.0 14.144.8 1.0 0.0 14.045.8 0.0 0.0 15.630.9 0.0 0.0 15.631.7 0.0 1.0 14.734.0 0.0 1.0......

Models with categorical explanatory variables  The plot of Price vs. Bedrooms appears to show a curved relationship.  Create a categorical variable Bed3 using ‘number of bedrooms’. Bed3 = 1 if the home has three or more bedrooms and Bed3 = 0 if it does not.  Bed3 is called an indicator variable.

Price and number of bedrooms: The fitted model is predicted Price = 75,700 + 15,146 Bed3 The coefficient for Bed3 is statistically significant (P = 0.0068). 19% of the variation is explained by the model. It suggests that Bed3 may be a useful explanatory variable.

FINAL MODEL: Price, square feet, and bathrooms. The fitted model is predicted Price = 59,268 + 16.78SqFt + 13,161B2 + 16,859Bh Where B2, Bh are indicator variables for an extra full bath and extra half bath, respectively. 57.7% of the variation in Price is explained by this model. All coefficients for the explanatory variables are statistically significant.

Variable Selection Methods  Sometimes the effect of one explanatory variable depends upon the value of another explanatory variable. We account for this situation in a regression model by including interaction terms.  Modern regression software offers variable selection methods that examine, for example, the R 2 values for all possible multiple regression models.  The software then presents us with the models having the highest R 2 for each number of explanatory variables.

First order model, two predictors and interaction y =  0 +  1 x 1 +  2 x 2 +  3 x 1 x 2 +  x1x1 X 2 = 2 X 2 = 3 X 2 =1  0 +  2 (1)] +[  1 +  3 (1)]x 1  0 +  2 (3)] +[  1 +  3 (3)]x1  0 +  2 (2)] +[  1 +  3 (2)]x 1 The two variables interact to affect the value of y. First order model y =  0 +  1 x 1 +  2 x 2 +  The effect of one predictor variable on y is independent of the effect of the other predictor variable on y. x1x1 X 2 = 1 X 2 = 2 X 2 = 3  0 +  2 (1)] +  1 x 1  0 +  2 (2)] +  1 x 1  0 +  2 (3)] +  1 x 1 Polynomial Models with Two Predictors

Second order model with interaction y =  0 +  1 x 1 +  2 x 2 +  3 x 1 2 +  4 x 2 2 +  5 x 1 2 x 2 2 +  Second order model y =  0 +  1 x 1 +  2 x 2 +  3 x 1 2 +  4 x 2 2 +  X 2 =1 X 2 = 2 X 2 = 3 x1x1 X 2 =1 X 2 = 2 X 2 = 3 Polynomial Models with Two Predictors x1x1 30

Interaction  The interaction effect is the effect of the variables interaction on the dependent variable. This is an extra effect that arise through the combination of two independent variables. X 1 Price of a product Y SaleX 2 Cost for advertising of a product

First-order model with interaction Second-order model with interaction

Example Suppose that an analyst working for a fast–food restaurant chain has been asked to construct a regression model that will help to identify new locations that are likely to be profitable. Analyst believes that annual gross sales (Y) are related with macro–economic variable ”mean annual household income in the neighborhood (X 1 ) and demographic variable ”mean age of children in the neighbourhood” (X 2 ) in a second order model with interaction between them. To estimate a model, he gets 25 areas in random and a portion of data is as follows Revene (Y) in $1000 Income (X 1 ) in $1000 age (X 2 ) in years income_sage_s@_income 1128 1005... 23.5 17.6.... 10.5 7.2..... 552.25 309.76..... 110.25 51.84..... 246.75 126.72....

The conditions required for the model assessment to apply must be checked.  Is the error variable normally distributed?  Is the error variance constant?  Are the errors independent?  Can we identify outlier? Regression Diagnostics Draw a histogram of the residuals Plot the residuals versus y ^ Plot the residuals versus the time periods

Multicollinearity (Colliniearity or intercorrelation) Two or more independent variables are highly correlated. X 1 share price X 2 interest Y return X 3 inflation F test of analysis of variance is not affected

Estimated model

Remedying Violations of Required Conditions of Regression  Non–normality or heteroscedasticity can be remedied using transformations on the y variable.  The transformations can improve the linear relationship between the dependent variable and the independent variables.  Many computer software systems allow us to make the transformations easily.

Transformation Y*=logY

A brief list of transformations 1. y* = log y (for y > 0) Use when the s 2  increases with y, or Use when the error distribution is positively skewed 2. y* = y 2 Use when the s 2 is proportional to E(y), or Use when the error distribution is negatively skewed 3. y* = y 1/2 (for y > 0) Use when the s 2 is proportional to E(y) 4. y* = 1/y Use when s 2 increases significantly when y increases beyond some critical value. Reducing Non–normality by Transformations

Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 6: Multiple Regression Model Building Priyantha.

Similar presentations

Presentation on theme: "Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 6: Multiple Regression Model Building Priyantha."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 6: Multiple Regression Model Building Priyantha.

Similar presentations

Presentation on theme: "Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 6: Multiple Regression Model Building Priyantha."— Presentation transcript:

Similar presentations

About project

Feedback