Interactions Interaction: Does the relationship between two variables depend on a third variable? Does the relationship of age to BP depend on gender Does a certain BP-lowering drug work as well in blacks than in non-blacks Does the relationship between education and income differ by region of the country Sometimes called “effect modification”
Model for FEV Example Y = b0 + b1X1 + b2X2 X1 = smoking status (1=smoker, 0=nonsmoker) X2 = age Smokers FEV = b0 + b1 + b2age Non Smokers FEV = b0 + b2age FEV (smokers) – FEV (non-smokers) = b1 Assumes the slope of age is same for smokers and non-smokers
Non-smokers FEV Smokers b1 b2 b1 b2 AGE
Modeling Interaction for FEV Example Y = b0 + b1X1 + b2X2 + b3X3 X1 = smoking status (1=smoker, 0=nonsmoker) X2 = age X3 = age x smoking status Smokers: FEV = Non Smokers: FEV = FEV (Smokers) – FEV (Non-smokers) = Ho: b3 = 0 b0 + b1 + (b2 + b3) age b0 + b2 age b1 + b3age
Non-smokers FEV b1 + b3age smokers b2 b2 + b3 AGE Note: Difference in slopes implies smoker/nonsmoker difference depends on age (and vice versa) Non-smokers FEV b1 + b3age smokers b2 b2 + b3 AGE
DATA fev; INFILE DATALINES; INPUT age smk fev; agesmk = age*smk; DATALINES; 28 1 4.0 30 1 3.9 30 1 3.7 31 1 3.6
PROC REG; MODEL fev = age; PLOT fev*age; WHERE smk=0; TITLE 'Non-smokers'; RUN; WHERE smk=1; TITLE 'Smokers';
SMOKERS Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 5.50002 0.36163 15.21 <.0001 age 1 -0.05508 0.00885 -6.22 <.0001 NON SMOKERS Intercept 1 5.24764 0.38050 13.79 <.0001 age 1 -0.03911 0.00887 -4.41 0.0007 B1 for smokers = -0.05508 B1 for non-smk = -0.03911 Are these statistically significant?
MODEL fev = age smk agesmk; RUN; PROC REG; MODEL fev = age smk agesmk; RUN; Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 5.24764 0.37846 13.87 <.0001 age 1 -0.03911 0.00882 -4.43 0.0002 smk 1 0.25238 0.52482 0.48 0.6346 agesmk 1 -0.01597 0.01253 -1.27 0.2138 Interpretation: B(agesmk) = -0.01597 is difference in slopes between smk/nonsmk B(age) = -0.03911 is slope for non-smokers (smk=0) SMOKERS Intercept 1 5.50002 0.36163 15.21 <.0001 age 1 -0.05508 0.00885 -6.22 <.0001 NON-SMOKERS Intercept 1 5.24764 0.38050 13.79 <.0001 age 1 -0.03911 0.00887 -4.41 0.0007
Polynomial Regression: Adding Quadratic Term Y = bo + b1X + b2X2 Can be used if linear relationship does not hold Example: alcohol intake and mortality Example: cholesterol and mortality Add a quadratic (squared) term Can test hypothesis that quadratic term in needed Ho: b2 = 0 Ha: b2 ≠ 0
Linear Regression Does not Fit Well
Adding Quadratic Term Plot mvo2kg*ffbw predicted.*ffbw/overlay
PROC REG DATA = physfit ; MODEL mvo2kg = ffbw; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 22211 22211 3.33 0.0724 Error 69 460225 6669.93228 Corrected Total 70 482436 Root MSE 81.66965 R-Square 0.0460 Dependent Mean 455.26761 Adj R-Sq 0.0322 Coeff Var 17.93882 Variable DF Estimate SE t Value Pr > |t| Intercept 1 382.51711 41.02856 9.32 <.0001 ffbw 1 0.17710 0.09705 1.82 0.0724
PROC REG DATA = physfit ; MODEL mvo2kg = ffbw; MODEL mvo2kg = ffbw ffbw2; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 113179 56589 10.42 0.0001 Error 68 369257 5430.25411 Corrected Total 70 482436 Root MSE 73.69026 R-Square 0.2346 Dependent Mean 455.26761 Adj R-Sq 0.2121 Coeff Var 16.18614 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 980.95393 150.82611 6.50 <.0001 ffbw 1 -2.68220 0.70406 -3.81 0.0003 ffbw2 1 0.00322 0.00078761 4.09 0.0001 ffbw2 = ffbw * ffbw Computed in datastep
Model Selection Measure many predictors; how do you decide which to include in your model? Depends on reason for fitting model Prediction? Examine specific effects? Statistical criteria do exist, should not be used in place of scientific criteria Best used in exploratory context
Statistical principles to use Forward, backward, and stepwise selection Compare p-values of terms; add/remove based on = 0.05 or 0.10 R2 methods Look for models with highest R2 Other methods exist
Possible Uses for Using Statistical Criteria Outcome: Measure of Teenage Drinking Many Possible Predictors Questionnaire on relationships, friends, family, church support etc. Outcome: Echocardographic determined hypertrophy of the heart Many Possible ECG predictors Computer measurements from ECG
Backward selection procedure Removes worst variable, then second worst, etc PROC REG DATA = physfit; MODEL mvo2kg = male age hgt wgt ffbw rhr / selection=backward; RUN; Final model: Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 574.86126 56.50900 167151 103.49 <.0001 male 88.90825 12.02381 88312 54.68 <.0001 age -6.85862 3.80692 5242.56660 3.25 0.0762 wgt -6.00865 1.02203 55827 34.56 <.0001 ffbw 0.75073 0.12729 56184 34.79 <.0001 rhr -0.79442 0.41916 5801.82822 3.59 0.0625
Forward selection procedure Start with best single variable, adds next best, etc PROC REG DATA = physfit; MODEL mvo2kg = male age hgt wgt ffbw rhr / selection=forward; RUN; This example - ends up including all terms except height Exactly same model as one picked by backward selection
“MAXR” method PROC REG DATA = physfit; Select several models based on maximal R2 PROC REG DATA = physfit; MODEL mvo2kg = male age hgt wgt ffbw rhr / selection=maxr; RUN; Will give “best” models with 1, 2, 3... Terms You choose best overall among the “best”
Final models by MAXR method
Two general principles to use Parsimony - less is more Common sense Don’t use social security number to predict height! Cautionary Note Models with several variables are not as good at predicting as model might suggest.