© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 2 Plan of the day In today’s lecture we continue our discussion of contingency tables. Topics –Simpson’s Paradox –Collapsing tables –Adequacy of the chi-square approximation See also Tutorial 10 (can download)

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 3 Example: Florida murders If we “collapse” the table over victim, we get the 2x 2 table of defendant by death penalty: Odds of being black for “no DP group” are 149/141 Odds of being black for “DP group” is 17/19 OR = (149/141)/(17/19) = (149*19)/(141*17) = 1.181 The odds of being black are 18% higher for the no DP group than for the DP group i.e. blacks favoured Death = N Death = Y Defendant = black 14917 Defendant = white 14119

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 4 Example: Florida murders The tables conditional on victim are Death = N Death = Y Defendant = black 976 Defendant = white 90 Death = N Death = Y Defendant = black 5211 Defendant = white 13219 Victim=black, OR =0.79 (add 0.5 to each cell) Victim=white, OR =0.67 Now odds of being black are greater in the DP group!

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 5 Association Reversal In our regression lectures, we saw that regression coefficients could change sign when new variables were added to the model. i.e. the coefficient of x1 in the model y ~ x1 can have the opposite sign to the coefficient of x1 in the model y ~ x1 + x2 –See e.g. Lecture 5, Slides 12 and 13.

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 6 The reason The coefficient of x1 in the model y~x1 is the slope of the fitted line in the scatter plot, summarising the marginal relationship between y and x1 The coefficient of x1 in the model y~x1+x2 is the slope of the line in the coplot, which summarises the relationship between y and x1, conditional on x2

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 7 Marginal and conditional association In the first case, we are looking at association between y and x1 In the second case, we are looking at the association between y and x1, with x2 held fixed (conditional on x2) We can do the same sort of thing in contingency tables

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 8 Association in contingency tables In 2 x 2 tables, we measure association using the odds ratio. If we have 3 factors A, B and C, each at 2 levels, we can look at the marginal odds ratio of A and B, using the AB table, ignoring C. Or, the 2 conditional odds ratios, one the AB table corresponding to C=1, and the other corresponding to C=2.

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 9 When will these be the same? In ordinary regression, the coefficient of x1 in the model y~x1 will be the same as the coefficient of x1 in the model y~x1+x2 if and only if x1 and x2 are uncorrelated. In contingency tables, the marginal and conditional population OR’s for the AB tables will be the same if –A and C are independent, given B, or, –B and C are independent, given A. Thus, we can collapse the table over C if –The ABC interaction is zero, and either the AC interaction is zero, or the BC interaction is zero. If this is not the case, association reversal ( Simpson’s paradox) may occur. The more associated A and/or B is with C, the more likely it is to happen.

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 10 Florida data: Very strong association between DP and race of victim: if victim is black, small chance of DP Very strong association between race of victim and race of defendant Since blacks murder blacks, (and whites murder whites), not so many DPs for black defendants overall

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 11 When can we collapse? Thus, we can collapse the table over C if –The ABC interaction is zero, and Either the AC interaction is zero, or The BC interaction is zero. If these conditions are not met, then the marginal and conditional tables may have different (even reversed) degrees of association

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 12 Florida data Recall that there are significant interactions between Race of victim and race of defendant Race of victim and death penalty Thus, can’t collapse over race of victim Call: glm(formula = counts ~ defendant * dp * victim, family = poisson, data = murder.df) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.27300 0.07161 73.633 < 2e-16 *** defendantw -2.32856 0.24033 -9.689 < 2e-16 *** dpy -2.70805 0.28645 -9.454 < 2e-16 *** victimw -0.61904 0.12105 -5.114 3.15e-07 *** defendantw:dpy -0.23639 1.06521 -0.222 0.82438 defendantw:victimw 3.25433 0.26657 12.208 < 2e-16 *** dpy:victimw 1.18958 0.36750 3.237 0.00121 ** defendantw:dpy:victimw -0.16131 1.10322 -0.146 0.88375

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 14 Example: Berkeley admissions Admission data from UC Berkeley grad school: OR’s are (Odds admission male/odds admission female) Admitted YesNo Gender Depart- mentMaleFemaleMaleFemale A5128931319 B353172078 C120202205391 D138131279244 E5394138299 F2224351317 A0.35 B0.80 C1.13 D0.92 E1.22 F0.83

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 15 Collapse over departments Admitted GenderYesNo Male11981493 Female5571278 OR = 1.84 (odds admission male/ odds admission female) Seems to be strong evidence of bias against women Reason: females apply for programs that have low admission rates, males for programs that have high admission rates, so strong dependence between dept and the other 2 factors

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 16 Anova Table Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 23 2650.10 Dept 5 159.52 18 2490.57 1.251e-32 Gender 1 162.87 17 2327.70 2.665e-37 Admit 1 230.03 16 2097.67 5.879e-52 Dept:Gender 5 1220.61 11 877.06 1.006e-261 Dept:Admit 5 855.32 6 21.74 1.242e-182 Gender:Admit 1 1.53 5 20.20 0.22 Dept:Gender:Admit 5 20.20 0 7.239e-14 1.144e-03

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 17 Thus 3 factor interaction is significant, so –Dept is not conditionally independent of admission, given gender –Dept is not conditionally independent of gender, given admission Collapse over Dept at your peril!

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 18 Moral: Collapsing tables Dangerous to collapse tables when factor being collapsed over (eg victim’s race, Berkeley department) is strongly associated with the other factors Marginal (collapsed) tables may give a very misleading picture, need to look at conditional tables

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 19 How good is chi-square? The bigger the cell counts, the better the chisquare approximation to the deviance The smaller the number of cells, the better the approximation But approximation is often OK even if some cells counts are small: our next example illustrates this.

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 20 The Dayton Study The data in this case study were gathered by researchers investigating teenage substance abuse in a community near Dayton, Ohio in 1992. The researchers asked 2276 high school students if they had ever used alcohol, cigarettes or marijuana.

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 22 Making the data frame: > counts = c(911,3,44,2,538,43,456,279) > ACM.df = data.frame(counts, expand.grid(A=c("Yes","No"), C=c("Yes","No"), M=c("Yes","No"))) > ACM.df counts A C M 1 911 Yes Yes Yes 2 3 No Yes Yes 3 44 Yes No Yes 4 2 No No Yes 5 538 Yes Yes No 6 43 No Yes No 7 456 Yes No No 8 279 No No No

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 23 Fitting models ham.glm = glm(counts~A*C*M, family=poisson, data=ACM.df) > anova(ACM.glm, test="Chisq") Analysis of Deviance Table Df Deviance Resid. Df Resid. Dev P(>|Chi|) NULL 7 2851.46 A 1 1281.71 6 1569.75 1.064e-280 C 1 227.81 5 1341.93 1.787e-51 M 1 55.91 4 1286.02 7.575e-14 A:C 1 442.19 3 843.83 3.607e-98 A:M 1 346.46 2 497.37 2.504e-77 C:M 1 497.00 1 0.37 4.283e-110 A:C:M 1 0.37 0 2.509e-14 0.54

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 24 Try the indicated homogeneous association model > ham.glm = glm(counts~A*C+A*M+C*M, family=poisson, data=ACM.df) > summary(ham.glm) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.81387 0.03313 205.699 < 2e-16 *** ANo -5.52827 0.45221 -12.225 < 2e-16 *** CNo -3.01575 0.15162 -19.891 < 2e-16 *** MNo -0.52486 0.05428 -9.669 < 2e-16 *** ANo:CNo 2.05453 0.17406 11.803 < 2e-16 *** ANo:MNo 2.98601 0.46468 6.426 1.31e-10 *** CNo:MNo 2.84789 0.16384 17.382 < 2e-16 *** Null deviance: 2851.46098 on 7 degrees of freedom Residual deviance: 0.37399 on 1 degrees of freedom

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 25 Summary The model counts~ A*C + C*M + A*M seems to fit well This is the homogeneous association model, can write counts~ (C+A+M)^2 Or counts~ A*C*M – A:C:M with all possible 2-way interactions but no 3-way interaction

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 26 Is Chi-square adequate? Consider the 3-dimensional table: it had two small counts We fitted a model to these data having all 2 factor interactions but no 3-factor interactions. We can estimate the Poisson means using predict > means = predict(ham.glm, type="response") > means 1 2 3 4 5 6 7 8 910.383170 3.616830 44.616830 1.383170 538.616830 42.383170 455.383170 279.616830

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 27 Is Chisquare adequate (2) Assuming these estimated means are the true ones, we can simulate a table by generating 8 new table entries. Thus, for example, if cell 1 has assumed mean 910.383, we generate an entry for cell 1 by selecting a value at random from a Poisson distribution with mean 910.38 > rpois(1, 910.38) [1] 969 Repeat for every cell, get a simulated table Calculate the deviance of the model for the simulated table Repeat for 10000 tables, record deviance each time.

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 29 Code means = predict(ham.glm, type="response") Nsim=10000 # generate 10000 tables data<-matrix(rpois(Nsim*8, means), 8, Nsim) result<-numeric(Nsim) for(i in 1:Nsim)result[i]<-deviance(glm(data[,i]~ A*C + C*M + A*M, family=poisson, data=ACM.df)) # draw a picture of the results hist(result, nclass=30, freq=F, xlab="deviance",main="Histogam of 10000 simulated deviances") # add chisquare 1 density xx<-seq(0,14,length=100) lines(xx,dchisq(xx,1), lwd=2, col="red")

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.

Similar presentations

Presentation on theme: "© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.

Similar presentations

Presentation on theme: "© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30."— Presentation transcript:

Similar presentations

About project

Feedback