University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Logistic Regression III/ (Hierarchical) Log-Linear Models II (Week 9)
… the story so far… In the session last Wednesday we identified a number of factors that affected the odds of being an owner-occupier (in the Coventry area in the mid-1980s!) Living with a partner, age, highest Goldthorpe class, and household income all improved the fit of the model significantly.
Changes in model fit Change inDeviance model chi-sq. d.f.p(-2LL) Adding: Partner Age Class Income
The ‘changing’ effect of a partner Exp(B) (Odds ratio) for having partner Model Partner Age Age + Class Age + Class + Income6.35
But I don’t remember those odds ratios! When one adds the variables in a series of blocks, the sample used throughout excludes the missing cases for class and income. So the ‘starting point’ effect of having a partner is lower (i.e instead of 14.4) But some of the effect of a partner is still ‘explained’ by age and by income (and some was ‘suppressed’ before class was added!)
Is there an interaction between the effects of living with a partner and age? Looking at the hard copy handout, the B of for such an interaction is statistically significant (p < 0.001) Note that the three effects, i.e. of age, of living with a partner, and of their interaction, need to be considered together.
What do the Bs for main effects mean when an interaction is included? The B of (p < 0.001) for age means that where someone lives without a partner there is a significant age effect. But if we change the point of reference (reference category) to living with a partner, the ‘revised’ B of (p = 0.143) means that for someone in this situation the (log) odds of owner occupation do not increase significantly with rising age…
Is a linear model OK? Apart from in the model just including living with a partner, the Hosmer and Lemeshow test does not indicate that assuming that the impact of the explanatory variables on the log odds of owner occupation is linear is problematic. Hosmer and Lemeshow Test Chi-squaredfSig
Looking at the hard copy handout… We can see that quite a lot of the evidence of a class effect (as indicated by the Wald statistic) disappears when income is included … and the effects for the bottom three classes compared to the top class (i.e. the relevant odds ratios/Exp(B)’s) get smaller (closer to 1!) too…
The income effect! While this is clearly significant overall (p < 0.001), the Bs and Exp(B)s are difficult to interpret (and mostly relate to insignificant comparisons). This is because the reference category (i.e. the first category indicating the lowest income range) is small and has an effect which doesn’t fit in with the broader trend)
Using a different reference category As the hard copy handout shows, if we use the tenth income category as the point of reference (this can be achieved by changing ‘1’ to ‘10’ in the relevant command pasted to a syntax window), then the categories of income lower than this largely have lower (log) odds of owner occupation, and higher incomes largely have higher (log) odds of owner occupation!
Collapsing the income variable But a lot of the income comparisons are still not statistically significant, in part because some of the categories are quite small… Using less categories would lose information, but would make presenting the model easier… What happens if we collapse income into four broad ranges? (Up to 50, 50-99, and 160+ pounds per week)
Not much… The Wald statistic only drops by 10.0 (i.e – 34.4) for a reduction of 14 degrees of freedom. And, more importantly, the gain in model fit by adding (i.e. including) the ‘whole’ income variable is only 13.6 for 14 degrees of freedom (p = 0.477)
Similarly… If class is collapsed into I-IV, V-VI and VII, the improvement in fit through adding the ‘whole’ class variable is only 0.3 for 4 d.f. (p = 0.992). If age is collapsed into ranges of 20-29, 30-39, and 50-60, the improvement in fit through adding the ‘whole’ age variable is only 1.3 for 1 d.f. (p = 0.263).
What about fitting a log-linear model to the five variables? Now that we have five categorical variables, we could use a log-linear model to establish which are related to which, and to what degree of complexity… As the hard copy handout shows, this indicates the preferred, ‘best’ model is: [AC] [PC] [TC] [CI] [TPI] [PAI] [TA] A= Age; C= Class; P=Partner; I=Income; T=Tenure
Uh-oh! This suggests that there may be more of a case for an interaction between the effects of living with a partner and income than between the effects of living with a partner and age… Adding this second interaction indeed improves the logistic regression model fit significantly by 23.0 for 3 d.f. (p < 0.001)
Nevertheless… While the hard copy handout suggests that including the second interaction renders the other statistically non-significant, it still looks substantively plausible (and detail has been lost from the age variable…) The interaction between the income effect and the living with a partner effect is most easily interpreted with reference to the relevant three-way cross-tabulation (see the hard copy handout).