Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference.

Slides:



Advertisements
Similar presentations
Class 21: Tues., Nov. 23 Today: Multicollinearity, One-way analysis of variance Schedule: –Tues., Nov. 30 th – Review, Homework 8 due –Thurs., Dec. 2 nd.
Advertisements

Class 18 – Thursday, Nov. 11 Omitted Variables Bias
Welcome to Econ 420 Applied Regression Analysis
Stat 112: Lecture 7 Notes Homework 2: Due next Thursday The Multiple Linear Regression model (Chapter 4.1) Inferences from multiple regression analysis.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Class 17: Tuesday, Nov. 9 Another example of interpreting multiple regression coefficients Steps in multiple regression analysis and example analysis Omitted.
Statistics for Managers Using Microsoft® Excel 5th Edition
Chapter 12 Simple Regression
Simple Linear Regression
Statistics for Managers Using Microsoft® Excel 5th Edition
Lecture 25 Multiple Regression Diagnostics (Sections )
Statistics 350 Lecture 16. Today Last Day: Introduction to Multiple Linear Regression Model Today: More Chapter 6.
Class 3: Thursday, Sept. 16 Reliability and Validity of Measurements Introduction to Regression Analysis Simple Linear Regression (2.3)
Lecture 24 Multiple Regression (Sections )
Multicollinearity Omitted Variables Bias is a problem when the omitted variable is an explanator of Y and correlated with X1 Including the omitted variable.
Lecture 24: Thurs., April 8th
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Lecture 23 Multiple Regression (Sections )
1 4. Multiple Regression I ECON 251 Research Methods.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Stat 112: Lecture 16 Notes Finish Chapter 6: –Influential Points for Multiple Regression (Section 6.7) –Assessing the Independence Assumptions and Remedies.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Objectives of Multiple Regression
Copyright © 2011 Pearson Education, Inc. Multiple Regression Chapter 23.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Statistics for Business and Economics 8 th Edition Chapter 11 Simple Regression Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
STA302/ week 111 Multicollinearity Multicollinearity occurs when explanatory variables are highly correlated, in which case, it is difficult or impossible.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.1 Using Several Variables to Predict a Response.
STA302/ week 911 Multiple Regression A multiple regression model is a model that has more than one explanatory variable in it. Some of the reasons.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 12-1 Correlation and Regression.
Introduction to Linear Regression
Applied Quantitative Analysis and Practices LECTURE#22 By Dr. Osman Sadiq Paracha.
Economics 173 Business Statistics Lecture 20 Fall, 2001© Professor J. Petry
Managerial Economics Demand Estimation & Forecasting.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Regression Analysis A statistical procedure used to find relations among a set of variables.
Stat 112 Notes 16 Today: –Outliers and influential points in multiple regression (Chapter 6.7)
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Lecture 10: Correlation and Regression Model.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Week 101 ANOVA F Test in Multiple Regression In multiple regression, the ANOVA F test is designed to test the following hypothesis: This test aims to assess.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
The simple linear regression model and parameter estimation
Chapter 4 Basic Estimation Techniques
Inference for Least Squares Lines
Linear Regression.
Basic Estimation Techniques
Analysis of Variance in Matrix form
Multiple Regression Analysis and Model Building
Chapter 5 STATISTICS (PART 4).
Simple Linear Regression
Chapter 13 Multiple Regression
Multicollinearity Multicollinearity occurs when explanatory variables are highly correlated, in which case, it is difficult or impossible to measure their.
Presentation transcript:

Stat 112 Notes 9 Today: –Multicollinearity (Chapter 4.6) –Multiple regression and causal inference

Assessing Quality of Prediction (Chapter 3.5.3) R squared is a measure of a fit of the regression to the sample data. It is not generally considered an adequate measure of the regression’s ability to predict the responses for new observations. One method of assessing the ability of the regression to predict the responses for new observations is data splitting. We split the data into a two groups – a training sample and a holdout sample (also called a validation sample). We fit the regression model to the training sample and then assess the quality of predictions of the regression model to the holdout sample.

Measuring Quality of Predictions

Multicollinearity DATA: A real estate agents wants to develop a model to predict the selling price of a home. The agent takes a random sample of 100 homes that were recently sold and records the selling price (y), the number of bedrooms (x 1 ), the size in square feet (x 2 ) and the lot size in square feet (x 3 ). Data is in houseprice.JMP.

Note: These results illustrate how the F test is more powerful for testing whether a group of slopes in multiple regression are all zero than individual t tests.

Multicollinearity Multicollinearity: Explanatory variables are highly correlated with each other. It is often hard to determine their individual regression coefficients. There is very little information in the data set to find out what would happen if we fix house size and change lot size.

Since house size and lot size are highly correlated, for fixed house size, lot sizes do not change much. The standard error for estimating the coefficient of lot sizes is large. Consequently the coefficient may not be significant. Similarly for the coefficient of house size. So, while it seems that at least one of the coefficients is significant (See ANOVA) you cannot tell which one is the useful one.

Consequences of Multicollinearity Standard errors of regression coefficients are large. As a result t statistics for testing the population regression coefficients are small. Regression coefficient estimates are unstable. Signs of coefficients may be opposite of what is intuitively reasonable (e.g., negative sign on lot size). Dropping or adding one variable in the regression causes large change in estimates of coefficients of other variables.

Detecting Multicollinearity 1.Pairwise correlations between explanatory variables are high. 2.Large overall F-statistic for testing usefulness of predictors but small t statistics. 3.Variance inflation factors

Using VIFs To obtain VIFs, after Fit Model, go to Parameter Estimates, right click, click Columns and click VIFs. Detecting multicollinearity with VIFs: – Any individual VIF greater than 10 indicates multicollinearity.

Multicollinearity and Prediction If interest is in predicting y, as long as pattern of multicollinearity continues for those observations where forecasts are desired (e.g., house size and lot size are either both high, both medium or both small), multicollinearity is not particularly problematic. If interest is in predicting y for observations where pattern of multicollinearity is different than that in sample (e.g., large house size, small lot size), no good solution (this would be extrapolation).

Problems caused by multicollinearity If interest is in predicting y, as long as pattern of multicollinearity continues for those observations where forecasts are desired (e.g., house size and lot size are either both high, both medium or both small), multicollinearity is not particularly problematic. If interest is in obtaining individual regression coefficients, there is no good solution in face of multicollinearity. If interest is in predicting y for observations where pattern of multicollinearity is different than that in sample (e.g., large house size, small lot size), no good solution (this would be extrapolation).

Dealing with Multicollinearity Suffer: If prediction within the range of the data is the only goal, not the interpretation of the coefficients, then leave the multicollinearity alone. Omit a variable. Multicollinearity can be reduced by removing one of the highly correlated variables. However, if one wants to estimate the partial slope of one variable holding fixed the other variables, omitting a variable is not an option, as it changes the interpretation of the slope.

California Test Score Data The California Standardized Testing and Reporting (STAR) data set californiastar.JMP contains data on test performance, school characteristics and student demographic backgrounds from Average Test Score is the average of the reading and math scores for a standardized test administered to 5 th grade students. One interesting question: What would be the causal effect of decreasing the student-teacher ratio by one student per teacher?

Multiple Regression and Causal Inference Goal: Figure out what the causal effect on average test score would be of decreasing student-teacher ratio and keeping everything else in the world fixed. Lurking variable: A variable that is associated with both average test score and student-teacher ratio. In order to figure out whether a drop in student- teacher ratio causes higher test scores, we want to compare mean test scores among schools with different student-teacher ratios but the same values of the lurking variables, i.e. we want to hold the value of the lurking variable fixed. If we include all of the lurking variables in the multiple regression model, the coefficient on student-teacher ratio represents the change in the mean of test scores that is caused by a one unit increase in student-teacher ratio.

Omitted Variables Bias Schools with many English learners tend to have worst resources. The multiple regression that shows how mean test score changes when student teacher ratio changes but percent of English learners is held fixed gives a better idea of the causal effect of the student- teacher ratio than the simple linear regression that does not hold percent of English learners fixed. Omitted variables bias: bias in estimating the causal effect of a variable from omitting a lurking variable from the multiple regression. Omitted variables bias of omitting percentage of English learners = (-1.10)=-1.28.

Key Warning About Using Multiple Regression for Causal Inference Even if we have included many lurking variables in the multiple regression, we may have failed to include one or not have enough data to include one. There will then be omitted variables bias. The best way to study causal effects is to do a randomized experiment.

Path Diagram Average Test Score Student- Teacher Ratio Percent English Learners Calworks % Other Lurking Variables