© Department of Statistics 2012 STATS 330 Lecture 19: Slide 1 Stats 330: Lecture 19
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 2 Plan of the day In today’s lecture, we look at some general strategies for choosing models having lots of continuous and categorical explanatory variables, and discuss an example.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 3 General Principle For a problem with both continuous and categorical explanatory variables, the most general model is to fit separate regressions for each possible combination of the factor levels. That is, we allow the categorical variables to interact with each other and the continuous variables.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 4 Illustration Two factors A and B, two continuous explanatory variables X and Z General model is y ~ A*B*X + A*B*Z Suppose A has a levels and B has b levels, so there are a b factor level combinations Each combination has a separate regression with 3 parameters –Constant term –Coefficient of X –Coefficient of Z
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 5 Illustration (Cont) There are a b constant terms, we can arrange them in a table Can split the table up into main effects and interactions as in 2 way anova Listed in output as Intercept, A, B and A:B
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 6 Illustration (Cont) There are a b X-coefficients, we can also arrange them in a table Again, we can split the table up into main effects and interactions as in 2 way anova Listed in output as X, A:X, B:X and A:B:X Ditto for Z If all the A:X, B:X, A:B:X interactions are zero, coefficient of X is the same for all the a b regressions
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 7 Model selection In these situations, the number of possible models is large Need variable selection techniques –Anova –stepwise Don’t include high order interactions unless you include lower order interactions
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 8 Caution Sometimes we don’t have enough data to fit a separate regression to each factor level combination (need at least one more data point than number of continuous variables per combination) In this case we drop out the higher level interactions, forcing coefficients to have common values.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 9 Example: Risk factors for low birthweight These data were collected at Baystate Medical Center, Springfield, Mass. during 1986, as part of a study to identify risk factors for low-birthweight babies. The response variable was birthweight, and data was collected on a variety of continuous and categorical explanatory variables
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 10 Variables age : mother's age in years, continuous lwt: mother's weight in pounds, continuous race: mother's race (`1' = white, `2' = black, `3' = other), factor smoke: smoking during pregnancy ( 1 =smoked, 0=didn’t smoke), factor ht: history of hypertension (0=No, 1=Yes), factor ui: presence of uterine irritability (0=No, 1=Yes), factor bwt: birth weight in grams, continuous, response Must be a factor!!
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 11 Preliminary plots
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 12 Plotting conclusions some relationships between bwt and the covariates –Slight relationship with lwt –Small effects due to the categorical variables On to fitting models……
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 13 Factor level combinations There are 2 continuous explanatory variables, and 4 categorical explanatory variables, race (3 levels), smoke (2 levels) ht (2 levels) and ui (2 levels). There are 3x2x2x2=24 factor level combinations. 24 regressions in all !!
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 14 Models The most general model would fit separate regression surfaces to each of the 24 combinations Assuming planes are appropriate, this means 24 x 3 = 72 parameters. There are 189 observations, so this is rather a lot of parameters. (usually we want at least 5 observations per parameter). In fact not all factor level combinations have enough data to fit a plane (need at least 3 points) The model fitting separate planes to each combination is bwt ~ age*race*smoke*ht*ui + lwt*race*smoke*ht*ui
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 15 Fitting Can fit the model and use the anova function to reduce number of variables > births.lm<-lm(bwt~age*race*smoke*ui*ht +lwt*race*smoke*ui*ht, data=births.df) > anova(births.lm) Also use the stepwise function with the forward option > null.lm<-lm(bwt~1,data=births.df) > step(null.lm, formula(births.lm), direction="forward")
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 16 Results: anova Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) age race ** smoke e-05 *** ui e-05 *** ht * lwt ** age:race age:smoke * race:smoke age:ui race:ui smoke:ui age:ht * race:ht smoke:ht race:lwt smoke:lwt
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 17 Results: anova (cont) Analysis of Variance Table Df Sum Sq Mean Sq F value Pr(>F) ui:lwt ht:lwt age:race:smoke age:race:ui age:smoke:ui race:smoke:ui age:race:ht age:smoke:ht race:smoke:lwt race:ui:lwt smoke:ui:lwt race:ht:lwt age:race:smoke:ui * race:smoke:ui:lwt Residuals
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 18 Results: stepwise (forward/both) Step: AIC= bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke Df Sum of Sq RSS AIC race:smoke ui:lwt smoke:ht ht:lwt age smoke:lwt race:ht race:lwt ui
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 19 Comparisons 3 models to compare –Full model –Model indicated by anova (model 2) bwt ~ age +ui + race + smoke + ht + lwt + age:ht + age:smoke, –Model chosen by stepwise (model 3) bwt ~ ui + race + smoke + ht + lwt + ht:lwt + race:smoke,
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 20 ModelAdj R2R2 Param- eters AIC Full Model Model Additive model extractAIC(model3.lm)
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 21 Deleting? Point 133 seems influential – big Cov ratio, HMD Refitting without 133 now makes model 3 the best – will go with model 3 Could also just use a purely additive model (i.e parallel planes) - but adjusted R 2 and AIC are slightly worse.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 22 Summary Model 3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** ui e-05 *** race ** race *** smoke *** ht ** lwt ht1:lwt * race2:smoke race3:smoke *
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 23 Interpretation (cont) Other things being equal: Uterine irritability associated with lower birthweight Smoking associated with lower birthweight, but differently for different races Hypertension associated with lower birthweight Race associated with lower birthweight –Black lower than white –“Other” lower than white Higher mother’s weight associated with higher birthweight, for hypertension group Smoking lowers birthweight more for race 1 (white). These effects significant but small compared to variability.
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 24 Interpretation of interactions WhiteBlackOther Smoke No Smoke Yes =
© Department of Statistics 2012 STATS 330 Lecture 19: Slide 25 Diagnostics for model 2 Check for high-influence etc Point 133 !!