Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.

Slides:



Advertisements
Similar presentations
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Advertisements

Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Lecture 25 Multiple Regression Diagnostics (Sections )
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Lecture 6: Multiple Regression
Lecture 24 Multiple Regression (Sections )
Multiple Linear Regression
Regression Diagnostics Checking Assumptions and Data.
EPI809/Spring Testing Individual Coefficients.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
Chapter 15: Model Building
Simple Linear Regression Analysis
Linear Regression/Correlation
Generalized Linear Models
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Correlation & Regression
Objectives of Multiple Regression
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
Simple Linear Regression
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
Lecture 12 Model Building BMTRY 701 Biostatistical Methods II.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Basics of Regression Analysis. Determination of three performance measures Estimation of the effect of each factor Explanation of the variability Forecasting.
Research Methods I Lecture 10: Regression Analysis on SPSS.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
©2006 Thomson/South-Western 1 Chapter 14 – Multiple Linear Regression Slides prepared by Jeff Heyl Lincoln University ©2006 Thomson/South-Western Concise.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Multiple Logistic Regression STAT E-150 Statistical Methods.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Variance Stabilizing Transformations. Variance is Related to Mean Usual Assumption in ANOVA and Regression is that the variance of each observation is.
Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y =  0 +  1 x 1 +  +  p x p +  Partial Regression.
Lab 4 Multiple Linear Regression. Meaning  An extension of simple linear regression  It models the mean of a response variable as a linear function.
Quantitative Methods Residual Analysis Multiple Linear Regression C.W. Jackson/B. K. Gordor.
1 BUSI 6220 By Dr. Nick Evangelopoulos, © 2012 Brief overview of Linear Regression Models (Pre-MBA level)
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
The simple linear regression model and parameter estimation
BINARY LOGISTIC REGRESSION
Chapter 9 Multiple Linear Regression
Chapter 12: Regression Diagnostics
Generalized Linear Models
بحث في التحليل الاحصائي SPSS بعنوان :
Diagnostics and Transformation for SLR
Multiple Regression Numeric Response variable (y)
Linear Regression/Correlation
Multiple Regression Numeric Response variable (Y)
Regression Forecasting and Model Building
Linear Regression and Correlation
Presentation transcript:

Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated Procedures and all possible regressions: –Backward Elimination (Top down approach) –Forward Selection (Bottom up approach) –Stepwise Regression (Combines Forward/Backward) –C p Statistic - Summarizes each possible model, where “best” model can be selected based on statistic

Backward Elimination Select a significance level to stay in the model (e.g. SLS=0.20, generally.05 is too low, causing too many variables to be removed) Fit the full model with all possible predictors Consider the predictor with lowest t-statistic (highest P-value). –If P > SLS, remove the predictor and fit model without this variable (must re-fit model here because partial regression coefficients change) –If P  SLS, stop and keep current model Continue until all predictors have P-values below SLS

Forward Selection Choose a significance level to enter the model (e.g. SLE=0.20, generally.05 is too low, causing too few variables to be entered) Fit all simple regression models. Consider the predictor with the highest t-statistic (lowest P-value) –If P  SLE, keep this variable and fit all two variable models that include this predictor –If P > SLE, stop and keep previous model Continue until no new predictors have P  SLE

Stepwise Regression Select SLS and SLE (SLE<SLS) Starts like Forward Selection (Bottom up process) New variables must have P  SLE to enter Re-tests all “old variables” that have already been entered, must have P  SLS to stay in model Continues until no new variables can be entered and no old variables need to be removed

All Possible Regressions - C p Fits every possible model. If K potential predictor variables, there are 2 K -1 models. Label the Mean Square Error for the model containing all K predictors as MSE K For each model, compute SSE and C p where p is the number of parameters (including intercept) in model Select the model with the fewest predictors that has C p  p

Regression Diagnostics Model Assumptions: –Regression function correctly specified (e.g. linear) –Conditional distribution of Y is normal distribution –Conditional distribution of Y has constant standard deviation –Observations on Y are statistically independent Residual plots can be used to check the assumptions –Histogram (stem-and-leaf plot) should be mound-shaped (normal) –Plot of Residuals versus each predictor should be random cloud U-shaped (or inverted U)  Nonlinear relation Funnel shaped  Non-constant Variance –Plot of Residuals versus Time order (Time series data) should be random cloud. If pattern appears, not independent.

Detecting Influential Observations  Studentized Residuals – Residuals divided by their estimated standard errors (like t-statistics). Observations with values larger than 3 in absolute value are considered outliers.  Leverage Values (Hat Diag) – Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations with values larger than 2(k+1)/n are considered to be potentially highly influential, where k is the number of predictors and n is the sample size.  DFFITS – Measure of how much an observation has effected its fitted value from the regression model. Values larger than 2*sqrt((k+1)/n) in absolute value are considered highly influential. Use standardized DFFITS in SPSS.

Detecting Influential Observations  DFBETAS – Measure of how much an observation has effected the estimate of a regression coefficient (there is one DFBETA for each regression coefficient, including the intercept). Values larger than 2/sqrt(n) in absolute value are considered highly influential.  Cook’s D – Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than 4/n are considered highly influential.  COVRATIO – Measure of the impact of each observation on the variances (and standard errors) of the regression coefficients and their covariances. Values outside the interval 1 +/- 3(k+1)/n are considered highly influential.

Obtaining Influence Statistics and Studentized Residuals in SPSS –. Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable and set of Independent variables from your model of interest (possibly having been chosen via an automated model selection method).. Under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and All Cases and CONTINUE. Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM. These give a plot of studentized residuals versus standardized predicted values, and a histogram of standardized residuals (residual/sqrt(MSE)). Select CONTINUE.. Under SAVE, select Studentized Residuals, Cook’s, Leverage Values, Covariance Ratio, Standardized DFBETAS, Standardized DFFITS. Select CONTINUE. The results will be added to your original data worksheet.

Variance Inflation Factors Variance Inflation Factor (VIF) – Measure of how highly correlated each independent variable is with the other predictors in the model. Used to identify Multicollinearity. Values larger than 10 for a predictor imply large inflation of standard errors of regression coefficients due to this variable being in model. Inflated standard errors lead to small t-statistics for partial regression coefficients and wider confidence intervals

Nonlinearity: Polynomial Regression When relation between Y and X is not linear, polynomial models can be fit that approximate the relationship within a particular range of X General form of model: Second order model (most widely used case, allows one “bend”): Must be very careful not to extrapolate beyond observed X levels

Generalized Linear Models (GLM) General class of linear models that are made up of 3 components: Random, Systematic, and Link Function –Random component: Identifies dependent variable (Y) and its probability distribution –Systematic Component: Identifies the set of explanatory variables (X 1,...,X k ) –Link Function: Identifies a function of the mean that is a linear function of the explanatory variables

Random Component Conditionally Normally distributed response with constant standard deviation - Regression models we have fit so far. Binary outcomes (Success or Failure)- Random component has Binomial distribution and model is called Logistic Regression. Count data (number of events in fixed area and/or length of time)- Random component has Poisson distribution and model is called Poisson Regression Continuous data with skewed distribution and variation that increases with the mean can be modeled with a Gamma distribution

Common Link Functions Identity link (form used in normal and gamma regression models): Log link (used when  cannot be negative as when data are Poisson counts): Logit link (used when  is bounded between 0 and 1 as when data are binary):

Exponential Regression Models Often when modeling growth of a population, the relationship between population and time is exponential: Taking the logarithm of each side leads to the linear relation: Procedure: Fit simple regression, relating log(Y) to X. Then transform back: