Regression: (2) Multiple Linear Regression and Path Analysis Hal Whitehead BIOL4062/5062
Multiple Linear Regression and Path Analysis Multiple linear regression –assumptions –parameter estimation –hypothesis tests –selecting independent variables –collinearity –polynomial regression Path analysis
Regression One Dependent VariableY Independent VariablesX 1,X 2,X 3,...
Purposes of Regression 1. Relationship between Y and X's 2. Quantitative prediction of Y 3. Relationship between Y and X controlling for C 4. Which of X's are most important? 5. Best mathematical model 6. Compare regression relationships: Y 1 on X, Y 2 on X 7. Assess interactive effects of X's
Simple regression: one X Multiple regression: two or more X's Y = ß 0 + ß 1 X(1) + ß 2 X(2) + ß 3 X(3) ß k X(k) + E
Multiple linear regression: assumptions (1) For any specific combination of X's, Y is a (univariate) random variable with a certain probability distribution having finite mean and variance (Existence) Y values are statistically independent of one another (Independence) Mean value of Y given the X's is a straight linear function of the X's (Linearity)
Multiple linear regression: assumptions (2) The variance of Y is the same for any fixed combinations of X's (Homoscedasticity) For any fixed combination of X's, Y has a normal distribution (Normality) There are no measurement errors in the X's (Xs measured without error)
Multiple linear regression: parameter estimation Y = ß 0 + ß 1 X(1) + ß 2 X(2) + ß 3 X(3) ß k X(k) + E Estimate the ß 's in multiple regression using least squares Sizes of the coefficients not good indicators of importance of X variables Number of data points in multiple regression –at least one more than number of X’s –preferably 5 times number of X’s
Why do Large Animals have Large Brains? (Schoenemann Brain Behav. Evol. 2004) Multiple regression of Y [Log (CNS)] on: X’ sßSE(ß) Log(Mass)-0.49(0.70) Log(Fat)-0.07(0.10) Log(Muscle)1.03(0.54) Log(Heart)0.42(0.22) Log(Bone)-0.07(0.30) N=39
Multiple linear regression: hypothesis tests Usually test: H0: Y = ß 0 + ß 1 ⋅ X(1) + ß 2 ⋅ X(2) ß j ⋅ X(j) + E H1: Y = ß 0 + ß 1 ⋅ X(1) + ß 2 ⋅ X(2) ß j ⋅ X(j) ß k ⋅ X(k) + E F-test with k-j, n-(k-j)-1 degrees of freedom (“partial F-test”) H0: variables X(j+1),…,X(k) do not help explain variability in Y
Multiple linear regression: hypothesis tests e.g. Test significance of overall multiple regression H0: Y = ß 0 + E H1: Y = ß 0 + ß 1 ⋅ X(1) + ß 2 ⋅ X(2) ß k ⋅ X(k) + E Test significance of –adding independent variable –deleting independent variable
Why do Large Animals have Large Brains? (Schoenemann Brain Behav. Evol. 2004) Multiple regression of Y [Log (CNS)] on: X’ sßSE(ß)P Log(Mass)-0.49(0.70)0.49 Log(Fat)-0.07(0.10)0.52 Log(Muscle)1.03(0.54)0.07 Log(Heart)0.42(0.22)0.06 Log(Bone)-0.07(0.30)0.83 Tests whether removal of variable reduces fit
Multiple linear regression: selecting independent variables Reasons for selecting a subset of independent variables (X’s): –cost (financial and other) –simplicity –improved prediction –improved explanation
Multiple linear regression: selecting independent variables Partial F-test –predetermined forward selection –forward selection based upon improvement in fit –backward selection based upon improvement in fit –stepwise (backward/forward) Mallow’s C(p) AIC
Multiple linear regression: selecting independent variables Partial F-test –predetermined forward selection Mass, Bone, Heart, Muscle, Fat –forward selection based upon improvement in fit –backward selection based upon improvement in fit –Stepwise (backward/forward)
Multiple linear regression: selecting independent variables Partial F-test –predetermined forward selection –forward selection based upon improvement in fit –backward selection based upon improvement in fit –stepwise (backward/forward)
Why do Large Animals have Large Brains? (Schoenemann Brain Behav. Evol. 2004) Complete model (r 2 =0.97): Forward stepwise (α-to-enter=0.15; α-to-remove=0.15): –1. Constant (r 2 =0.00) –2. Constant + Muscle (r 2 =0.97) –3. Constant + Muscle + Heart (r 2 =0.97) –4. Constant + Muscle + Heart + Mass (r 2 =0.97) xMass +1.24xMuscle xHeart
Why do Large Animals have Large Brains? (Schoenemann Brain Behav. Evol. 2004) Complete model (r 2 =0.97): Backward stepwise (α-to-enter=0.15; α-to-remove=0.15): –1. All (r 2 =0.97) –2. Remove Bone (r 2 =0.97) –3. Remove Fat (r 2 =0.97) xMass +1.24xMuscle xHeart
Comparing models Mallow’s C(p) –C(p) = (k-p).F(p) + (2p-k+1) k parameters in full model; p parameters in restricted model F(p) is the F value comparing the fit of the restricted model with that of the full model –Lowest C(p) is best model Akaike Information Criteria (AIC) –AIC=n.Log(σ 2 ) +2p –Lowest AIC indicates best model –Can compare models not included in one another
Comparing models
Collinearity If two (or more) X’s are linearly related: –they are collinear –the regression problem is indeterminate X(3)=5.X(2)+16, or X(2)=4.X(1)+ 16.X(4) If they are nearly linearly related (near collinearity), coefficients and tests are very inaccurate
What to do about collinearity? Centering (mean = 0) Scaling (SD =1) Regression on first few Principal Components Ridge Regression
Curvilinear (Polynomial) Regression Y = ß 0 + ß 1 ⋅ X + ß 2 ⋅ X² + ß 3 ⋅ X ß k ⋅ X k + E Used to fit fairly complex curves to data ß’s estimated using least squares Use sequential partial F-tests, or AIC, to find how many terms to use –k>3 is rare in biology Better to transform data and use simple linear regression, when possible
Curvilinear (Polynomial) Regression Y= X Y= X X² Y= X X² X 3 From Sokal and Rohlf
Path Analysis
Models with causal structure Represented by path diagram All variables quantitative All path relationships assumed linear –(transformations may help) A B C D E
Path Analysis All paths one way –A => C –C => A No loops Some variables may not be directly observed: –residual variables (U) Some variables not observed but known to exist –latent variables (D) A B C D E U
Path Analysis Path coefficients and other statistics calculated using multiple regressions Variables are: –centered (mean = 0) so no constants in regressions –often standardized (SD = 1) So: path coefficients usually between -1 and +1 Paths with coefficients not significantly different from zero may be eliminated A B C D E U
Path Analysis: an example Isaak and Hubert “Production of stream habitat gradients by montane watersheds: hypothesis tests based on spatially explicit path analyses” Can. J. Fish. Aquat. Sci.
- - - Predicted negative interaction ________ Predicted positive interaction