Selecting Variables and Avoiding Pitfalls Chapters 6 and 7.

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

Multiple Regression and Model Building
Multiple Regression in Practice The value of outcome variable depends on several explanatory variables. The value of outcome variable depends on several.
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Simple Linear Regression. Start by exploring the data Construct a scatterplot  Does a linear relationship between variables exist?  Is the relationship.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Ch11 Curve Fitting Dr. Deshi Ye
Analysis of Economic Data
MULTIPLE REGRESSION. OVERVIEW What Makes it Multiple? What Makes it Multiple? Additional Assumptions Additional Assumptions Methods of Entering Variables.
Chapter 13 Additional Topics in Regression Analysis
To accompany Quantitative Analysis for Management, 9e by Render/Stair/Hanna 4-1 © 2006 by Prentice Hall, Inc., Upper Saddle River, NJ Chapter 4 RegressionModels.
Comparing the Various Types of Multiple Regression
Chapter 10 Simple Regression.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Lecture 6: Multiple Regression
1 Chapter 9 Variable Selection and Model building Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Multiple Linear Regression
Multiple Regression and Correlation Analysis
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Chapter 15: Model Building
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Correlation & Regression
Objectives of Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Chapter 13: Inference in Regression
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
Chapter 12 Multiple Regression and Model Building.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Multiple Regression Analysis
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
CHAPTER 14 MULTIPLE REGRESSION
Introduction to Linear Regression
2 Multicollinearity Presented by: Shahram Arsang Isfahan University of Medical Sciences April 2014.
Chap 14-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 14 Additional Topics in Regression Analysis Statistics for Business.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Building the Regression Model –I Selection and Validation KNN Ch. 9 (pp )
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Exam 2 Review. Data referenced throughout review An Educational Testing Service (ETS) research scientist used multiple regression analysis to model y,
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Chapter 13 Understanding research results: statistical inference.
Regression Analysis Part A Basic Linear Regression Analysis and Estimation of Parameters Read Chapters 3, 4 and 5 of Forecasting and Time Series, An Applied.
Model selection and model building. Model selection Selection of predictor variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Chapter 9 Multiple Linear Regression
CHAPTER 26: Inference for Regression
Prepared by Lee Revere and John Large
Hypothesis testing and Estimation
Simple Linear Regression
Regression Forecasting and Model Building
Chapter 11 Variable Selection Procedures
Presentation transcript:

Selecting Variables and Avoiding Pitfalls Chapters 6 and 7

Let’s start with the pitfalls What do you think they are?

Young children who sleep with the light on are much more likely to develop myopia in later life. This result of a study at the University of Pennsylvania Medical Center was published in the May 13, 1999, issue of Nature. However a later study at Ohio State University did not find any link between infants sleeping with the light on and developing myopia but did find a strong link between parental myopia and the development of child myopia and also noted that myopic parents were more likely to leave a light on in their children's bedroom. What’s going on here?

Remember…Correlation does not imply Causation! A statistically significant relationship between a response y and predictor x does not necessarily imply a cause-and-effect relationship.

Caution: Lack of variability or small n The number of levels of a quantitative variable must be at least one more than the order of the polynomial x that you want to fit. To fit a straight line, you need at least two different x values; how many do you need to fit a curve? Sample size n must be large enough so that the degrees of freedom (n-(k+1)) for estimating σ 2 exceeds 0.

Caution: Interpreting the magnitude of β i coefficient as determining the importance of x i With complex models, not all βs have practical interpretation. Unless coefficients are standardized, we cannot compare β values. To standardize in Minitab: Stat  Regression  Storage  Standardized Coefficients

Caution: Multicollinearity When 2 or more independent variables are moderately to highly correlated with each other The best regression models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but correlate-- at most-- only minimally with each other

How do I know if multicollinearity is present? Correlation matrix Stat > Basic Statistics > Correlation (select all variables of interest) Look for non-significant t tests for individual β parameters when the F test is significant Look for opposite signs with β than you expected

VIF The variance inflation factor (VIF): measures how much the variance of an estimated regression coefficient increases if your predictors are correlated (multicollinear). VIF = 1 indicates no relation; VIF > 1, otherwise. When VIF is greater than 10, then the regression coefficients are poorly estimated. In Regression Window, select Options button, select Variance Inflation Factors under Display Using the VIF value, you can calculate an R 2 to relate one of the independent variables to the remaining independent variables (p. 349).

Caution 2: Violating the Assumptions What are the assumptions about ε? mean value of ε for any given set of values of x 1, x 2,…x k is E (ε )= 0 ε has a normal probability distribution with mean equal to 0 and variance equal to σ 2 Random errors are independent If data violate the assumptions, derived inferences are suspect… so methodology must be modified We will dive into this in chapter 8!

Selecting Variables Last class and journal: Compared complex models to reduced models Tested a portion of complete model parameters with a nested model F test Today: Methods for selecting which independent variables to include from many possible variables

Paring down We start with a comprehensive model that includes all conceivable, testable influences on the phenomena under investigation. We want to end up with the simplest model possible. In addition to literature, theory and plotting data, stepwise regression can help in selecting variables. Parsimony: the smaller number of βs, the better (Simpler models are easier to understand and appreciate, and therefore have a "beauty" that their more complicated counterparts often lack.)

What we’ve done: R 2 Criterion and Adjusted R 2 (MSE Criterion) Explore with different variables in the regression equation… but always keep common sense/ literature in mind! Add enough variables that the R 2 is sufficiently large You can also look at the adjusted R 2, which is adjusted as the number of βs increases Select the simplest model with the highest R 2 (or R 2 adj) You can also search for the model with the minimum MSE (included in the ANOVA output) Remember: parsimony

Another option: Stepwise Regression (1) You identify an initial model with lots of potential variables (2) The software repeatedly alters the model at the previous step by adding a predictor variable in accordance with the "stepping criteria." (3) The search is terminated when stepping is no longer possible given the stepping criteria, or when a specified maximum number of steps has been reached.

Today’s Data Set: Pulse Each student in a class recorded his or her height, weight, gender, smoking preference, usual activity level, and resting pulse. They all flipped coins, and those whose coins came up heads ran in place for one minute. Afterward, the entire class recorded their pulses once more. We now want to find the best predictors for the second pulse rate.

Stepwise Regression Starts with no predictors. Each of the available predictors is evaluated with respect to how much R 2 would be increased by adding it to the model. The one which will most increase R 2 will be added if it meets the statistical criterion for entry. This procedure is repeated until there remain no more predictors that are eligible for entry.

Open the Pulse data set. 1) Choose Stat > Regression > Stepwise. 2) In Response, enter Pulse2. 3) In Predictors, enter Pulse1 Ran-Weight. 4) Click OK Which variables were selected?

Backward Elimination Fits a model with terms for all potential variables, then drops the variable with the smallest t statistic 1) Choose Stat > Regression > Stepwise. 2) In Response, enter Pulse2. 3) In Predictors, enter Pulse1 Ran-Weight. 4) Click Methods 5) Select Backward elimination 6) Click OK (twice) In light of today’s cautions, what could be a disadvantage to this procedure?

Best Subsets Regression Stat  Regression  Best Subsets

C p Criterion Used to compare the full model to a model with a subset of predictors. Look for models where Mallows' C p is small and close to p, where p is the number of predictors in the model, including the constant. A small C p value indicates that the model is relatively precise (has small variance) in estimating the true regression coefficients and predicting future responses. Models with considerable lack-of-fit and bias have values of C p larger than p.

PRESS Criterion Prediction Sum of Squares is the predicted value for the i th observation obtained when the regression model is fit with the data point for the i th observation deleted from the example. Small differences in y values indicate that the model is predicting well Minitab: Stat  Regression  Regression  Options (select Press)

Cautions These are screening methods, not your final decision-maker Interactions Non-linear relationships