F73DB3 CATEGORICAL DATA ANALYSIS Workbook Contents page Preface Aims Summary Content/structure/syllabus plus other information Background – computing (R) hwu
Examples Single classifications (1-13) Two-way classifications (14-27) Three-way classifications (28-32) hwu
Example 1 Eye colours Colour A B C D Frequency observed
hwu Example 2 Prussian cavalry deaths (a)Numbers killed in each unit in each year - frequency table Number killed 5 Total Frequency observed
hwu Example 2 Prussian cavalry deaths (b) Numbers killed in each unit in each year – raw data …
hwu Example 2 Prussian cavalry deaths (c) Total numbers killed each year 1875 ’76 ’77 ’78 ’79 ’80 ’81 ’82 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ‘
hwu Example 4 Political views (very L) (centre) (very R) Don’t Know Total
hwu Example 7 Vehicle repair visits Number of visits 6 Total Frequency observed
hwu Example 15 Patients in clinical trial DrugPlaceboTotal Side-effects15419 No side-effects Total50 100
§1 INTRODUCTION Data are counts/frequencies (not measurements) Categories (explanatory variable) Distribution in the cells (response) Frequency distribution Single classifications Two-way classifications hwu
B: Cause of death CancerOther A: Smoking status Smoker3020 Not smoker1535 Illustration 1.1
Data may arise as Bernoulli/binomial data (2 outcomes) Multinomial data (more than 2 outcomes) Poisson data [+ Negative binomial data – the version with range x = 0,1,2, …] hwu
§2 POISSON PROCESS AND ASSOCIATED DISTRIBUTIONS
hwu 2.1 Bernoulli trials and related distributions Number of successes – binomial distribution [Time before k th success – negative binomial distribution Time to first success – geometric distribution] Conditional distribution of success times
hwu 2.2 Poisson process and related distributions time
hwu Poisson process with rate λ Number of events in a time interval of length t, N t, has a Poisson distribution with mean t
hwu Poisson process with rate λ Inter-event time, T, has an exponential distribution with parameter ( mean 1/ )
hwu given n events in time (0,t) how many in time (0,s) (s < t)? Conditional distribution of number of events
hwu given n events in time (0,t) how many in time (0,s) (s < t)? Conditional distribution of number of events Answer N s |N t = n ~ B(n,s/t)
hwu Splitting into subprocesses time
hwu Realisation of a Poisson process # events time
hwu X ~ Pn( ), Y ~ Pn( ) X,Y independent then we know X + Y ~ Pn( + ) Given X + Y = n, what is distribution of X?
hwu X ~ Pn( ), Y ~ Pn( ) X,Y independent then we know X + Y ~ Pn( + ) Given X + Y = n, what is distribution of X? Answer X|X+Y=n ~ B(n,p) where p = /( + )
hwu 2.3 Inference for the Poisson distribution N i, i = 1, 2, …, r, i. i. d. Pn(λ), N=ΣN i
hwu CI for.
hwu 2.4 Dispersion and LR tests for Poisson data Homogeneity hypothesis H 0 : the N i s are i. i. d. Pn( ) (for some unknown ) Dispersion statistic (M = sample mean)
hwu Likelihood ratio statistic form for calculation – see p18 ◄◄
hwu §3 SINGLE CLASSIFICATIONS Binary classifications (a) N 1, N 2 independent Poisson, with N i ~ Pn( i ) or (b) fixed sample size, N 1 + N 2 = n, with N 1 ~ B(n,p 1 ) where p 1 = 1 /( )
hwu Qualitative categories (a) N 1, N 2, …, N r independent Poisson, with N i ~ Pn(λ i ) or (b) fixed sample size n, with joint multinomial distribution Mn(n;p)
hwu Testing goodness of fit H 0 : p i = i, i = 1,2, …, r This is the (Pearson) chi-square statistic
hwu The statistic often appears as
hwu
An alternative statistic is the LR statistic
hwu Sparse data/small expected frequencies ensure m i 1 for all cells, and m i 5 for at least about 80% of the cells if not - combine adjacent cells sensibly
hwu Goodness-of-fit tests for frequency distributions - very well-known application of the statistic (see Illustration 3.4 p 22/23)
hwu Residuals (standardised)
hwu Residuals (standardised) simpler version
hwu Number of papers per author Number of authors MAJOR ILLUSTRATION 1 Publish and be modelled Model
hwu MAJOR ILLUSTRATION 2 Birds in hedges Hedge type i A B C D EF G Hedge length (m) l i Number of pairs n i Model N i ~ Pn( i l i )
hwu Example 14 Numbers of mice bearing tumours in treated and control groups TreatedControlTotal Tumours459 No tumours Total §4 TWO-WAY CLASSIFICATIONS
hwu Example 15 Patients in clinical trial DrugPlaceboTotal Side-effects15419 No side-effects Total50 100
hwu Patients in clinical trial – take 2 DrugPlaceboTotal Side-effects15 30 No side-effects35 70 Total50 100
4.1 Factors and responses F × R tables R × F, R × R (F × F ?) Qualitative, ordered, quantitative Analysis the same - interpretation may be different hwu
A two-way table is often called a “contingency table” (especially in R R case). hwu
ExposedNot exposedTotal Diseasen 11 n 12 n 1● No diseasen 21 n 22 n 2● Totaln ●1 n ●2 n ●● = n Notation (2 2 case, easily extended)
hwu Three possibilities One overall sample, each subject classified according to 2 attributes - this is R × R Retrospective study Prospective study (use of treated and control groups; drug and placebo etc)
hwu (a) R × R case (a1) N ij ~ Pn( ij ), independent or, with fixed table total (a2) Condition on n = n ij : N|n ~ Mn(n ; p) where N = {N ij }, p = {p ij }. 4.2 Distribution theory and tests for r × s tables
hwu (b) F × R case Condition on the observed marginal totals nj = n ij for the s categories of F ( condition on n and n 1 ) s independent multinomials
hwu Usual hypotheses (a1) N ij ~ Pn( ij ), independent H 0 : variables/responses are independent ij = i j / = k i (a2) Multinomial data (table total fixed) H 0 : variables/responses are independent P(row i and column j) = P(row i)P(column j)
hwu (b) Condition on n and n j (fixed column totals) N ij ~ Bi( n j, p ij ) j = 1,2, …, s ; independent H 0 : response is homogeneous (p ij = p i for all j) i.e. response has the same distribution for all levels of the factor
hwu where m ij = n i n j /n as before Tests of H 0 The χ 2 (Pearson) statistic:
hwu where m ij = n i n j /n as before Tests of H 0 The χ 2 (Pearson) statistic:
hwu OR: test based on the LR statistic Y 2 Illustration: tonsils data – see p27 In R Pearson/X 2 : read data in using “matrix” then use “chisq.test” LR Y 2 : calculate it directly (or get it from the results of fitting a “log-linear model”- see later)
hwu Statistical tests (a) Using Pearson’s χ2 4.3 The 2 2 table DrugPlaceboTotal Side-effects15419 No side-effects Total50 100
hwu where m ij = n i n j /n
hwu Yates (continuity) correction Subtract 0.5 from |O – E| before squaring it Performing the test in R n.pat=matrix(c(15,35,4,46),2,2) chisq.test(n.pat)
hwu (b) Using deviance/LR statistic Y 2 (c) Comparing binomial probabilities (d) Fisher’s exact test
hwu DrugPlaceboTotal Side-effects154 N 19 No side-effects Total50 100
hwu Under a random allocation one-sided P-value = P(N 4) =
hwu In the 2 2 table, the H 0 : independence condition is equivalent to 11 22 = 12 21 Let λ = log( 11 22 / 12 21 ) Then we have H 0 : λ = 0 λ is the “log odds ratio” 4.4 Log odds, combining and collapsing tables, interactions
hwu The “λ = 0” hypothesis is often called the “no association” hypothesis.
hwu The odds ratio is 11 22 / 12 21 Sample equivalent is
hwu The odds ratio (or log odds ratio) provides a measure of association for the factors in the table. no association odds ratio = 1 log odds ratio = 0
hwu Don’t combine heterogeneous tables!
hwu Interaction An interaction exists between two factors when the effect of one factor is different at different levels of another factor.
hwu
§5 INTRODUCTION TO GENERALISED LINEAR MODELS (GLMs) Normal linear model Y|x ~ N with E[Y|x]= + x or E[Y|x]= 0 + 1 x 1 + 2 x 2 + … + r x r = x i.e. E[Y|x] = (x) = x
hwu We are explaining (x) using a linear predictor (a linear function of the explanatory data) Generalised linear model Now we set g( (x)) = x for some function g We explain g( (x)) using a linear function of the explanatory data, where g is called the link function
hwu e.g. modelling a Poisson mean we use a log link g( ) = log We use a linear predictor to explain log rather than itself : the model is Y|x ~ Pn with mean λ x with log λ x = + x or log λ x = x This is a log-linear model
hwu An example is a trend model in which we use log i = + i Another example is a cyclic model in which we use log i = 0 + 1 cosθ i + 2 sinθ i
hwu §6 MODELS FOR SINGLE CLASSIFICATIONS 6.1 Single classifications - trend models Data: numbers in r categories Model: N i, i = 1, 2, …, r, independent Pn(λ i )
hwu Basic case H 0 : λ i ’s equal v H 1 : λ i ’s follow a trend Let X j be category of observation j P(X j = i) = 1/r Test based on see Illustration 6.1
hwu A more general model N i independent Pn(λ i ) with Log-linear model
hwu It is a linear regression model for logλ i and a non-linear regression model for λ i. It is a generalised linear model. Here the link between the parameter we are estimating and the linear estimator is the log function - it is a “log link”.
hwu Fitting in R Example 13: stressful events data >n=c(15,11, …, 1, 4) >r=length(n) >i=1:r
hwu >n=c(15,11, …, 1, 4) response vector >r=length(n) >i=1:r explanatory vector model >stress=glm(n~i,family=poisson)
hwu >summary(stress) Call: glm(formula = n ~ i, family = poisson) model being fitted Deviance Residuals: Min 1Q Median 3Q Max summary information on the residuals Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** i e-07 *** information on the fitted parameters
hwu Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: on 17 degrees of freedom Residual deviance: on 16 degrees of freedom deviances (Y2 statistics) AIC: Number of Fisher Scoring iterations: 4
hwu Fitted mean is e.g. for date 6, i = 6 and fitted mean is exp( ) = 9.980
hwu Fitted model
hwu Test of H 0 : no trend the null fit, all fitted values equal (to the observed mean) Y 2 = (~ 2 on 17df) The trend model fitted values exp( i) Y 2 = (~ 2 on 16df) Crude 95% CI for slope is ± 2(0.0168) i.e ± 0.034
hwu The lower the value of the residual deviance, the better in general is the fit of the model.
hwu Basic residuals
hwu 6.2 Taking into account a deterministic denominator – using an “offset” for the “exposure” Model: N x ~ Pn(λ x ) where E[N x ] = λ x = E x bθ x logλ x = logE x + c + dx See the Gompertz model example (p 40, data in Example 26)
hwu We include a term “offset(logE)” in the formula for the linear predictor: in R model = glm(n.deaths ~ age + offset(log(exposure)), family = poisson) Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E)
hwu §7 LOGISTIC REGRESSION for modelling proportions we have a binary response for each item and a quantitative explanatory variable for example: dependence of the proportion of insects killed in a chamber on the concentration of a chemical present – we want to predict the proportion killed from the concentration
hwu for example: dependence of the proportion of women who smoke - on age metal bars on test which fail - on pressure applied policies which give rise to claims – on sum insured Model: # successes at value x i of explanatory variable: N i ~ bi(n i, π i )
hwu We use a glm – we do not predict π i directly; we predict a function of π i called the logit of π i. The logit function is given by: It is the “log odds”.
See Illustration 7.1 p 43: proportion v dose
logit(proportion) v dose
hwu This leads to the “logistic regression” model [ c.f. log linear model N i ~ Poisson(λ i ) with log λ i = a + bx i ]
hwu We are using a logit link We use a linear predictor to explain rather than itself
hwu The method based on the use of this model is called logistic regression
hwu Data: explanatory # successes group observed variable value size proportion x 1 n 11 n 1 n 11 /n 1 x 2 n 21 n 2 n 21 /n 2 ……. x s n s1 n s n s1 /n s
hwu In R we declare the proportion of successes as the response and include the group sizes as a set of weights drug.mod1 = glm(propdead ~ dose, weights = groupsize, family = binomial) explanatory vector is dose note the family declaration
hwu RHS of model can be extended if required to include additional explanatory variables and factors e.g. mod3 = glm(mat3 ~ age+socialclass+gender)
hwu drug.mod – see output p44 Coefficients very highly significant (***) Null deviance 298 on 9df Residual deviance 17.2 on 8df But … residual v fitted plot and … fitted v observed proportions plot
hwu
model with a quadratic term (dose^2)
hwu 8.1 Log-linear models for two-way classifications N ij ~ Pn( ij ), i= 1,2, …, r ; j = 1,2, …, s H 0 : variables are independent ij = i j / §8 MODELS FOR TWO-WAY AND THREE-WAY CLASSIFICATIONS
hwu log ij = log i + log j log row effect overall effect column effect
hwu We “explain” log ij in terms of additive effects: log ij = + α i + β j Fitted values are the expected frequencies Fitting process gives us the value of Y 2 = -2logλ
hwu N ij ~ Pn( ij ), independent, with log ij = + α i + β j Declare the response vector (the cell frequencies) and the row/column codes as factors then use > name = glm(…) Fitting a log-linear model
hwu Tonsils data (Example 16) n.tonsils = c(19,497,29,560,24,269) rc = factor(c(1,2,1,2,1,2)) cc = factor(c(1,1,2,2,3,3)) tonsils.mod1 = glm(n.tonsils ~ rc + cc, family=poisson)
Call: glm(formula = n.tonsils2 ~ rc + cc, family = poisson) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** rc < 2e-16 *** cc * cc e-14 *** --- Null deviance: on 5 degrees of freedom Residual deviance: on 2 degrees of freedom Y 2 = - 2logλ
hwu The fit of the “independent attributes” model is not good
hwu > n.patients = c(15, 4, 35, 46) > rc = factor(c(1, 1, 2, 2)) > cc = factor(c(1, 2, 1, 2)) > pat.mod1 = glm(n.patients ~ rc + cc, family = poisson) Patients data (Example 15)
Call: glm(formula = n.patients ~ rc + cc, family = poisson) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.251e e < 2e-16 *** rc e e e-08 *** cc e e e Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: on 3 degrees of freedom Residual deviance: on 1 degrees of freedom AIC:
hwu fitted coefficients: coef(pat.mod1) (Intercept) rc2 cc e e e-10 fitted values: fitted(pat.mod1)
hwu Estimates are Predictors for cells 1,1 and 1,2 are : exp( ) = 9.5 exp( ) = 40.5 Predictors for cells 2,1 and 2,2 are = :
hwu Residual deviance: on 1 degree of freedom Y 2 for testing the model i.e. for testing H 0 : response is homogeneous/ column distributions are the same/ no association between response and treatment group The lower the value of the residual deviance, the better in general is the fit of the model. Here the fit of the additive model is very poor (we have of course already concluded that there is an association – P-value about 1%).
hwu 8.2 Two-way classifications - taking into account a deterministic denominator See the grouse data (Illustration 8.3 p50, data in Example 25) Model: N ij ~ Pn(λ ij ) where E[N ij ] = λ ij = E ij exp( + α i + β j ) logE[N ij /E ij ] = + α i + β j i.e. logλ ij = logE ij + + α i + β j
hwu We include a term “offset(logE)” in the formula for the linear predictor Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E)
hwu 8.3 Log-linear models for three-way classifications Each subject classified according to 3 factors/variables with r,s,t levels respecitvely N ijk ~ Pn( ijk ) with log ijk = + α i + β j + γ k + (αβ) ij + (αγ) ik + (βγ) jk + (αβγ) ijk r s t parameters
hwu Model with two factors and an interaction (no longer additive) is log ij = + α i + β j + (αβ) ij Recall “interaction”
hwu Range of possible models/dependencies From 1 Complete independence model formula: A + B + C link: log ijk = + α i + β j + γ k notation: [A][B][C] df: rst – r – s – t Hierarchic log-linear models Interpretation!
hwu …. through 2 One interaction (B and C say) model formula: A + B*C link: log ijk = + α i + β j + γ k + (βγ) jk notation: [A][BC] df: rst – r – st + 1
hwu …. to 5 All possible interactions model formula: A*B*C notation: [ABC] df: 0
hwu Model selection: by backward elimination or forward selection through the hierarchy of models containing all 3 variables
hwu saturated [ABC] [AB] [AC] [BC] [AB] [AC] [AB] [BC] [AC][BC] [AB] [C] [A] [BC] [AC] [B] [A] [B] [C] independence
hwu Our models can include mean (intercept) + factor effects + 2-way interactions + 3-way interaction
hwu Illustration 8.4 Models for lizards data (Example 29) liz = array(c(32, 86, 11, 35, 61, 73, 41, 70), dim = c(2, 2, 2)) n.liz = as.vector(liz) s = factor(c(1,1,1,1,2,2,2,2)) species d = factor(c(1, 1, 2, 2, 1, 1, 2, 2)) diameter of perch h = factor(c(1,2,1,2,1,2,1,2)) height of perch
hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) liz.mod3 = glm(n.liz ~ s + d*h, family = poisson) liz.mod4 = glm(n.liz ~ s*h + d, family = poisson) liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson) liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson )
hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) on 4df liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † on 3df liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson) liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson )
hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)† 2.03 on 2df
hwu > summary(liz.mod5) Call: glm(formula = n.liz ~ s * d + s * h, family = poisson) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** s ** d e-08 *** h e-09 *** s2:d *** s2:h ** Null deviance: on 7 degrees of freedom Residual deviance: on 2 degrees of freedom
hwu
FIN
hwu Number of papers per author Number of authors MAJOR ILLUSTRATION 1 Model
hwu
MAJOR ILLUSTRATION 2 Hedge type i A B C D EF G Hedge length (m) l i Number of pairs n i Model N i ~ Pn( i l i )
hwu
Cyclic models
hwu Model N i independent Pn(λ i ) with Explanatory variable: the category/month i has been transformed into an angle i
hwu It is another example of a non-linear regression model for Poisson responses. It is a generalised linear model.
hwu Fitting in R >n=c(40, 34, …, 33, 38) response vector >r=length(n) >i=1:r >th=2*pi*i/r explanatory vector model >leuk=glm(n~cos(th) + sin(th),family=poisson)
hwu Fitted mean is
hwu Fitted model
hwu MaleFemale Cinema often2221 Not often2012 F73DB3 CDA Data from class
hwu MaleFemale Cinema often Not often
MaleFemale Cinema often Not often P(often|male) = 22/42 = P(often|female) = 21/33 = significant difference (on these numbers)? is there an association between gender and cinema attendance?
hwu Null hypothesis H 0 : no association between gender and cinema attendance Alternative: not H 0 Under H 0 we expect 42 43/75 = in cell 1,1 etc.
hwu > matcinema=matrix(c(22,20,21,12),2,2) > chisq.test(matcinema) Pearson's Chi-squared test with Yates' continuity correction data: matcinema X-squared = , df = 1, p-value = > chisq.test(matcinema)$expected [,1] [,2] [1,] [2,]
hwu > matcinema=matrix(c(22,20,21,12),2,2) > chisq.test(matcinema) Pearson's Chi-squared test with Yates' continuity correction data: matcinema X-squared = , df = 1, p-value = > chisq.test(matcinema)$expected [,1] [,2] null hypothesis can stand [1,] no association between gender [2,] and cinema attendance
hwu MaleFemale Cinema often Not often P(often|male) = 110/210 = P(often|female) = 105/60 = significant difference (on these numbers)? more students, same proportions
hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2
hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 X-squared = , df = 1, p-value = > chisq.test(matcinema2)$expected [,1] [,2] [1,] [2,]
hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 X-squared = , df = 1, p-value = > chisq.test(matcinema2)$expected [,1] [,2] null hypothesis is rejected [1,] there IS an association between [2,] gender and cinema attendance
hwu FIN