Topic 16: Multicollinearity and Polynomial Regression.

Slides:



Advertisements
Similar presentations
3.3 Hypothesis Testing in Multiple Linear Regression
Advertisements

Topic 12: Multiple Linear Regression
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
 Population multiple regression model  Data for multiple regression  Multiple linear regression model  Confidence intervals and significance tests.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Topic 15: General Linear Tests and Extra Sum of Squares.
CS Example: General Linear Test (cs2.sas)
Topic 3: Simple Linear Regression. Outline Simple linear regression model –Model parameters –Distribution of error terms Estimation of regression parameters.
Creating Graphs on Saturn GOPTIONS DEVICE = png HTITLE=2 HTEXT=1.5 GSFMODE = replace; PROC REG DATA=agebp; MODEL sbp = age; PLOT sbp*age; RUN; This will.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Multiple regression analysis
LINEAR REGRESSION: Evaluating Regression Models. Overview Standard Error of the Estimate Goodness of Fit Coefficient of Determination Regression Coefficients.
Matrix A matrix is a rectangular array of elements arranged in rows and columns Dimension of a matrix is r x c  r = c  square matrix  r = 1  (row)
January 6, morning session 1 Statistics Micro Mini Multiple Regression January 5-9, 2008 Beth Ayers.
Lecture 23 Multiple Regression (Sections )
EPI809/Spring Testing Individual Coefficients.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Objectives of Multiple Regression
Topic 18: Model Selection and Diagnostics
Topic 28: Unequal Replication in Two-Way ANOVA. Outline Two-way ANOVA with unequal numbers of observations in the cells –Data and model –Regression approach.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Topic 2: An Example. Leaning Tower of Pisa Construction began in 1173 and by 1178 (2 nd floor), it began to sink Construction resumed in To compensate.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Topic 7: Analysis of Variance. Outline Partitioning sums of squares Breakdown degrees of freedom Expected mean squares (EMS) F test ANOVA table General.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Topic 14: Inference in Multiple Regression. Outline Review multiple linear regression Inference of regression coefficients –Application to book example.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation Note: Homework Due Thursday.
Regression For the purposes of this class: –Does Y depend on X? –Does a change in X cause a change in Y? –Can Y be predicted from X? Y= mX + b Predicted.
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Xuhua Xia Polynomial Regression A biologist is interested in the relationship between feeding time and body weight in the males of a mammalian species.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Topic 17: Interaction Models. Interaction Models With several explanatory variables, we need to consider the possibility that the effect of one variable.
6-1 Introduction To Empirical Models Based on the scatter diagram, it is probably reasonable to assume that the mean of the random variable Y is.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Topic 6: Estimation and Prediction of Y h. Outline Estimation and inference of E(Y h ) Prediction of a new observation Construction of a confidence band.
Topic 13: Multiple Linear Regression Example. Outline Description of example Descriptive summaries Investigation of various models Conclusions.
Chapter 13 Multiple Regression
Topic 25: Inference for Two-Way ANOVA. Outline Two-way ANOVA –Data, models, parameter estimates ANOVA table, EMS Analytical strategies Regression approach.
Topic 26: Analysis of Covariance. Outline One-way analysis of covariance –Data –Model –Inference –Diagnostics and rememdies Multifactor analysis of covariance.
PSYC 3030 Review Session April 19, Housekeeping Exam: –April 26, 2004 (Monday) –RN 203 –Use pencil, bring calculator & eraser –Make use of your.
1 Quadratic Model In order to account for curvature in the relationship between an explanatory and a response variable, one often adds the square of the.
Topic 4: Statistical Inference. Outline Statistical inference –confidence intervals –significance tests Statistical inference for β 1 Statistical inference.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Topic 24: Two-Way ANOVA. Outline Two-way ANOVA –Data –Cell means model –Parameter estimates –Factor effects model.
Week 101 ANOVA F Test in Multiple Regression In multiple regression, the ANOVA F test is designed to test the following hypothesis: This test aims to assess.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Topic 20: Single Factor Analysis of Variance. Outline Analysis of Variance –One set of treatments (i.e., single factor) Cell means model Factor effects.
EXTRA SUMS OF SQUARES  Extra sum squares  Decomposition of SSR  Usage of Extra sum of Squares  Coefficient of partial determination.
Statistical Data Analysis 2010/2011 M. de Gunst Lecture 9.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
1 Topic 3 – Multiple Regression Analysis Regression on Several Predictor Variables (Chapter 8)
Topic 29: Three-Way ANOVA. Outline Three-way ANOVA –Data –Model –Inference.
Anareg Week 10 Multicollinearity Interesting special cases Polynomial regression.
Topic 27: Strategies of Analysis. Outline Strategy for analysis of two-way studies –Interaction is not significant –Interaction is significant What if.
Topic 21: ANOVA and Linear Regression. Outline Review cell means and factor effects models Relationship between factor effects constraint and explanatory.
Summary of the Statistics used in Multiple Regression.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Stats Methods at IC Lecture 3: Regression.
Regression Diagnostics
Multiple Regression Analysis and Model Building
Multiple Regression.
Fundamentals of regression analysis
Multiple Regression II
CHAPTER 29: Multiple Regression*
Multiple Regression II
Presentation transcript:

Topic 16: Multicollinearity and Polynomial Regression

Outline Multicollinearity Polynomial regression

An example (KNNL p256) The P-value for ANOVA F-test is <.0001 The P values for the individual regression coefficients are , , and None of these are near our standard significance level of 0.05 What is the explanation? Multicollinearity!!!

Multicollinearity Numerical analysis problem in that the matrix X΄X is close to singular and is therefore difficult to invert accurately Statistical problem in that there is too much correlation among the explanatory variables and it is therefore difficult to determine the regression coefficients

Multicollinearity Solve the statistical problem and the numerical problem will also be solved –We want to refine a model that currently has redundancy in the explanatory variables –Do this regardless if X΄X can be inverted without difficulty

Multicollinearity Extreme cases can help us understand the problems caused by multicollinearity –Assume columns in X matrix were uncorrelated Type I and Type II SS will be the same The contribution of each explanatory variable to the model is the same whether or not the other explanatory variables are in the model

Multicollinearity –Suppose a linear combination of the explanatory variables is a constant Example: X 1 = X 2 → X 1 -X 2 = 0 Example: 3X 1 -X 2 = 5 Example: SAT total = SATV + SATM The Type II SS for the X’s involved will all be zero

An example: Part I Data a1; infile ‘../data/csdata.dat'; input id gpa hsm hss hse satm satv genderm1; Data a1; set a1; hs = (hsm+hse+hss)/3; Proc reg data=a1; model gpa= hsm hss hse hs; run;

Output Something is wrong df M =3 but there are 4 X’s Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total

Output NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.

Output NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown. hs = *hsm *hss *hse

Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t|Type I SSType II SS Intercept hsmB < hssB hseB hs In this extreme case, SAS does not consider hs in the model

Extent of multicollinearity This example had one explanatory variable equal to a linear combination of other explanatory variables This is the most extreme case of multicollinearity and is detected by statistical software because (X΄X) does not have an inverse We are concerned with cases less extreme

An example: Part II *add a little noise to break up perfect linear association; Data a1; set a1; hs1 = hs + normal(612)*.05; Proc reg data=a1; model gpa= hsm hss hse hs1; run;

Output Model seems to be good here Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model <.0001 Error Corrected Total

Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t|Type I SSType II SS Intercept hsm hss hse hs None of the predictors significant. Look at the differences in Type I and II sums of squares Sign of the coefficients?

Effects of multicollinearity Regression coefficients are not well estimated and may be meaningless Similarly for standard errors of these estimates Type I SS and Type II SS will differ R 2 and predicted values are usually ok

Pairwise Correlations Pairwise correlations can be used to check for “pairwise” collinearity Recall KNNL p256 proc reg data=a1 corr; model fat=skinfold thigh midarm; model midarm = skinfold thigh; run;

Pairwise Correlations Cor(skinfold, thigh)= Cor(skinfold, midarm) = Cor(thigh, midarm) = Cor(midarm,skinfold+thigh) = !!! See p 284 for change in coeff values of skinfold and thigh depending on what variables are in the model

Polynomial regression We can fit a quadratic, cubic, etc. relationship by defining squares, cubes, etc., of a single X in a data step and using them as additional explanatory variables We can do this with more than one explanatory variable if needed Issue: When we do this we generally create a multicollinearity problem

KNNL Example p300 Response variable is the life (in cycles) of a power cell Explanatory variables are –Charge rate (3 levels) –Temperature (3 levels) This is a designed experiment

Input and check the data Data a1; infile ‘../data/ch08ta01.txt'; input cycles chrate temp; run; Proc print data=a1; run;

Output Obscycleschratetemp

Create new variables and run the regression Data a1; set a1; chrate2=chrate*chrate; temp2=temp*temp; ct=chrate*temp; Proc reg data=a1; model cycles= chrate temp chrate2 temp2 ct; run;

Output Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model Error Corrected Total

Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept chrate temp chrate temp ct

Conclusion Overall F significant, individual t’s not significant → multicollinearity problem Look at the correlations (proc corr) There are some very high correlations –r(chrate,chrate2) = –r(temp,temp2) = Correlation between powers of a variable

A remedy We can remove the correlation between explanatory variables and their powers by centering Centering means that you subtract off the mean before squaring etc. KNNL rescaled by standardizing (subtract the mean and divide by the standard deviation) but subtracting the mean is key here

A remedy Use Proc Standard to center the explanatory variables Recompute the squares, cubes, etc., using the centered variables Rerun the regression analysis

Proc standard Data a2; set a1; schrate=chrate; stemp=temp; keep cycles schrate stemp; Proc standard data=a2 out=a3 mean=0 std=1; var schrate stemp; Proc print data=a3; run;

Output Obscyclesschratestemp

Recompute squares and cross product Data a3; set a3; schrate2=schrate*schrate; stemp2=stemp*stemp; sct=schrate*stemp;

Rerun regression Proc reg data=a3; model cycles=schrate stemp schrate2 stemp2 sct; run;

Output Analysis of Variance SourceDF Sum of Squares Mean SquareF ValuePr > F Model Error Corrected Total Exact same ANOVA table as before

Output Parameter Estimates VariableDF Parameter Estimate Standard Errort ValuePr > |t| Intercept schrate stemp schrate stemp sct

Conclusion Overall F significant Individual t’s significant for chrate and temp Appears linear model will suffice Could do formal general linear test to assess this. (P-value is )

Last slide We went over KNNL 7.6 and 8.1. We used programs Topic16.sas to generate the output for today