1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

Polynomial Regression and Transformations STA 671 Summer 2008.
/k 2DS00 Statistics 1 for Chemical Engineering lecture 4.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
EPI 809/Spring Probability Distribution of Random Error.
Creating Graphs on Saturn GOPTIONS DEVICE = png HTITLE=2 HTEXT=1.5 GSFMODE = replace; PROC REG DATA=agebp; MODEL sbp = age; PLOT sbp*age; RUN; This will.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Multiple regression analysis
Reading – Linear Regression Le (Chapter 8 through 8.1.6) C &S (Chapter 5:F,G,H)
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 5th Edition
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
Lecture 6: Multiple Regression
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
1 Regression Analysis Regression used to estimate relationship between dependent variable (Y) and one or more independent variables (X). Consider the variable.
EPI809/Spring Testing Individual Coefficients.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
This Week Continue with linear regression Begin multiple regression –Le 8.2 –C & S 9:A-E Handout: Class examples and assignment 3.
Chapter 15: Model Building
Correlation and Regression Analysis
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Regression and Correlation Methods Judy Zhong Ph.D.
Inference for regression - Simple linear regression
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Variable selection and model building Part II. Statement of situation A common situation is that there is a large set of candidate predictor variables.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
23-1 Analysis of Covariance (Chapter 16) A procedure for comparing treatment means that incorporates information on a quantitative explanatory variable,
Lecture 4 SIMPLE LINEAR REGRESSION.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
12a - 1 © 2000 Prentice-Hall, Inc. Statistics Multiple Regression and Model Building Chapter 12 part I.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons Business Statistics, 4e by Ken Black Chapter 15 Building Multiple Regression Models.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
Regression Examples. Gas Mileage 1993 SOURCES: Consumer Reports: The 1993 Cars - Annual Auto Issue (April 1993), Yonkers, NY: Consumers Union. PACE New.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Anaregweek11 Regression diagnostics. Regression Diagnostics Partial regression plots Studentized deleted residuals Hat matrix diagonals Dffits, Cook’s.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
6-3 Multiple Regression Estimation of Parameters in Multiple Regression.
1 Experimental Statistics - week 14 Multiple Regression – miscellaneous topics.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Xuhua Xia Correlation and Regression Introduction to linear correlation and regression Numerical illustrations SAS and linear correlation/regression –CORR.
Simple Linear Regression. Data available : (X,Y) Goal : To predict the response Y. (i.e. to obtain the fitted response function f(X)) Least Squares Fitting.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Variable Selection 1 Chapter 8 Variable Selection Terry Dielman Applied Regression Analysis:
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
Experimental Statistics - week 3
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
1 Statistics 262: Intermediate Biostatistics Regression Models for longitudinal data: Mixed Models.
1 Experimental Statistics - week 13 Multiple Regression Miscellaneous Topics.
Experimental Statistics - week 9
Using SPSS Note: The use of another statistical package such as Minitab is similar to using SPSS.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
1 Experimental Statistics - week 12 Chapter 11: Linear Regression and Correlation Chapter 12: Multiple Regression.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
Model selection and model building. Model selection Selection of predictor variables.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
Regression Models First-order with Two Independent Variables
Chapter 9 Multiple Linear Regression
Presentation transcript:

1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking

2 Y X 1 X Data – Page 628 Y = weight loss (wtloss) X 1 = exposure time (exptime) X 2 = relative humidity (humidity) Weight loss in a chemical compound as a function of exposure time and humidity

3 The REG Procedure Dependent Variable: wtloss Number of Observations Read 12 Number of Observations Used 12 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept exptime <.0001 humidity Chemical Weight Loss – MLR Output

4 Examining Contributions of Individual X variables Use t -test for the X variable in question. - this tests the effect of that particular independent variable while all other independent variables stay constant. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept exptime <.0001 humidity Note: In this equation, weight loss is positively related to exposure time and negatively to humidity.

5 Residual Analysis in Multiple Regression Examination of residuals to help determine if: - assumptions are met - regression model is appropriate Residual Plots: - each indep var. in final model vs residuals - predicted Y vs residuals - run order vs residuals

6 PROC REG; MODEL wtloss=exptime humidity; output out=new r=resid2 p=predict2; RUN; PROC GPLOT; Title 'Plot of Residuals - MLR Model'; PLOT resid2*exptime; PLOT resid2*humidity; PLOT resid2*predict2; RUN;

7 Infant Length Data (Probability and Statistics for Engineers and Scientists – Walpole, Myers, Myers, and Ye, page 433) Data Set: 9 infants (2-3 months of age) Dependent Variable (Y): Current Infant length (cm) Independent Variables: X1 = age (in days) X2 = length at birth (cm) X3 = weight at birth (kg) X4 = chest size at birth (cm) Goal: Obtain an estimating equation relating length of an infant to all or a subset of these independent variables. DATA infant; INPUT id y x1 x2 x3 x4; DATALINES; ; PROC CORR; Var y x1 x2 x3 x4; RUN; PROC REG; MODEL y=x1 x2 x3 x4; output out=new r=resid; RUN;

8 SAS PROC CORR Output Pearson Correlation Coefficients, N = 9 Prob > |r| under H0: Rho=0 y x1 x2 x3 x4 y x < x < x x Note: x1, x2, and x3 are significantly correlated with y while x4 is not. Recall, this indicates that the simple linear regression of y using either x1, x2, or x3 will be significant. Standard SAS PROC REG Output for all 4 Independent Variables X1, X2, X3, and X4 Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model Error C Total Root MSE R-square Dep Mean Adj R-sq C.V Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP X X X X Note: Even though the overall p-value is small (.0003), there is much confusion concerning the contribution of the individual X variables - this is probably due to collinearity

9 Setting: We have a dependent variable Y and several candidate independent variables. Question: Should we use all of them?

10 Why do we run Multiple Regression? 1. Obtain estimates of individual coefficients in a model (+ or -, etc.) 2. Screen variables to determine which have a significant effect on the model 3. Arrive at the most effective (and efficient) prediction model

11 The problem: Collinearity among the independent variables -- high correlation between 2 independent variables -- one independent variable nearly a linear combination of other independent variables -- etc. Example: x 1 = total income x 2 = bonus x 3 = monthly income Note: x 1 = 12 x 3 + x 2 -- singularity -- SAS cannot use all 3

12 Effects of Collinearity parameter estimates are highly variable and unreliable - parameter estimates may even have the opposite sign from what is reasonable may have significant F but none of the t-tests are significant Variable Selection Techniques Techniques for “being careful” about which variables are put into the model

13 Variable Selection Procedures Forward selection Backward Elimination Stepwise Best subset

14 Forward Selection: Step 1: Choose X j that has highest R 2 (i.e. has the highest correlation with Y) -- call it X 1 Step 2: Choose another X j to go along with X 1 by finding the one that maximizes R 2 Note: This new R 2 will be at least as large as the one in Step 1. Problem: Has the new variable increased R 2 enough to be “useful”? Solution: Examine the significance level (p) of the new variable -- keep variable if p < SLENTRY (I used SLENTRY =.15 in example) Procedure continues until no new variables satisfy entry criteria

15 FORWARD SELECTION RESULTS FROM SAS Stepwise Procedure for Dependent Variable Y Step 1 Variable X1 Entered R-square = C(p) = DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP X Note: These F values are the squares of the usual t-values in SAS Bounds on condition number: 1, Step 2 Variable X3 Entered R-square = C(p) = DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP X X Bounds on condition number: , All variables left in the model are significant at the level. No other variable met the significance level for entry into the model. Summary of Stepwise Procedure for Dependent Variable Y Variable Number Partial Model Step Entered Removed In R**2 R**2 C(p) F Prob>F 1 X X This is the end of the SAS FORWARD SELECTION output. The final regression equation is: We can see from the model that an increase in age or in the weight at birth predicts longer current length. NOTICE: SAS picked 2 independent variables and then stopped. PROC reg; MODEL y=x1 x2 x3x4 /selection=forward slentry=.15; RUN; The next pages show SAS output from standard PROC REG. Each set of output on the following pages is from a separate running of PROC REG.

16 Standard SAS PROC REG Printout for 3 Features - to show why STEPWISE Procedure stopped with 2 features X1, X3, and X4 Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model Error C Total Root MSE R-square Dep Mean Adj R-sq C.V Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP X X X Note: p-value for X4 is too large. X1, X3, and X2 Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model Error C Total Root MSE R-square Dep Mean Adj R-sq C.V Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP X X X Note: X2 really messes up the p-values, and the p-value for X2 is too large

17 Standard SAS PROC Reg Output for X1 and for X1 & X3 X1 Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model Error C Total Root MSE R-square Dep Mean Adj R-sq C.V Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP X X1 and X3 Dependent Variable: Y Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model Error C Total Root MSE R-square Dep Mean Adj R-sq C.V Parameter Estimates Parameter Standard T for H0: Variable DF Estimate Error Parameter=0 Prob > |T| INTERCEP X X

18 Plots for Residual Analysis for the Final Model, i.e. x1 x3 IDPredicted Values

19 Begin with all independent variables in the model Find the independent variable that is “least useful” in predicting the dependent variable (i.e. smallest R 2, F (or t), etc.) –delete this variable if p < SLSTAY Continue the process until no further variables are deleted Backward Elimination

20 Add independent variables one at a time as in Forward Selection (if p < SLENTRY) At each stage perform backward elimination to see whether any variables should be removed (if p < SLSTAY) Stepwise Selection

21 Examine criteria for all acceptable subsets of each “size”, i.e. # of independent variables Criteria: R 2, adjusted R 2, C p Best Subset Regression

22 -- adjusts for the number of independent variables -- penalizes excessive use of independent variables -- useful for comparing competing models with differing number of independent variables Adjusted R 2 - C p statistic plays a similar role

23 Multiple Regression – Analysis Suggestions 1. Examine pairwise correlations among variables 2. Examine pairwise scatterplots among variables

24 SPSS Output from INFANT Data Set

25 SPSS Output from CAR Data Set

26 Multiple Regression – Analysis Suggestions 1. Examine pairwise correlations among variables 2. Examine pairwise scatterplots among variables - identify nonlinearity - identify unequal variance problems - identify possible outliers 3. Try transformations of variables for - correcting nonlinearity - stabilizing the variances - inducing normality of residuals

27 Examples of Nonlinear Data “Shapes” and Linearizing Transformations

28 Original Model  1 > 0  1 < 0 Transformed Into: Exponential Transformation (Log-Linear)

29 Transformed Multiplicative Model (Log-Log)

30  1 > 0  1 < 0 Square Root Transformation

31 Note: - transforming Y using the log or square root transformation can help with unequal variance problems - these transformations may also help induce normality

32 hmpg vs hp hmpg vs sqrt(hp) log(hmpg) vs hp log(hmpg) vs log(hp)