F73DB3 CATEGORICAL DATA ANALYSIS Workbook Contents page Preface Aims Summary Content/structure/syllabus plus other information Background – computing (R)

F73DB3 CATEGORICAL DATA ANALYSIS Workbook Contents page Preface Aims Summary Content/structure/syllabus plus other information Background – computing (R)

2 Examples Single classifications (1-13) Two-way classifications (14-27) Three-way classifications (28-32) hwu

3 Example 1 Eye colours Colour A B C D Frequency observed 89 66 60 85

4 hwu Example 2 Prussian cavalry deaths (a)Numbers killed in each unit in each year - frequency table Number killed 0 1 2 3 4  5 Total Frequency observed 144 91 32 11 2 0280

5 hwu Example 2 Prussian cavalry deaths (b) Numbers killed in each unit in each year – raw data 0 0 1 0 0 2 0 0 0 0....................... 0 0 0 2 0 1 0 1 2 0 1........................0 ….. 3 0 0 1 0 0 2 1 0 0 1 0 0 1 0 0 1 1 2 0 1 0 1 1

6 hwu Example 2 Prussian cavalry deaths (c) Total numbers killed each year 1875 ’76 ’77 ’78 ’79 ’80 ’81 ’82 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ‘94 3 5 7 9 10 18 6 14 11 9 5 11 15 6 11 17 12 15 8 4

7 hwu Example 4 Political views 1 2 3 4 5 6 7 (very L) (centre) (very R) Don’t Know Total 46 179 196 559 232 150 35931490

8 hwu Example 7 Vehicle repair visits Number of visits 0 1 2 3 4 5  6 Total Frequency observed 295 190 53 5 5 2 0550

9 hwu Example 15 Patients in clinical trial DrugPlaceboTotal Side-effects15419 No side-effects354681 Total50 100

10 §1 INTRODUCTION Data are counts/frequencies (not measurements) Categories (explanatory variable) Distribution in the cells (response) Frequency distribution Single classifications Two-way classifications hwu

11 B: Cause of death CancerOther A: Smoking status Smoker3020 Not smoker1535 Illustration 1.1

12 Data may arise as Bernoulli/binomial data (2 outcomes) Multinomial data (more than 2 outcomes) Poisson data [+ Negative binomial data – the version with range x = 0,1,2, …] hwu


14 hwu 2.1 Bernoulli trials and related distributions Number of successes – binomial distribution [Time before k th success – negative binomial distribution Time to first success – geometric distribution] Conditional distribution of success times

15 hwu 2.2 Poisson process and related distributions    time 

16 hwu Poisson process with rate λ Number of events in a time interval of length t, N t, has a Poisson distribution with mean t

17 hwu Poisson process with rate λ Inter-event time, T, has an exponential distribution with parameter ( mean 1/ )

18 hwu given n events in time (0,t)  how many in time (0,s) (s < t)?  Conditional distribution of number of events

19 hwu given n events in time (0,t)  how many in time (0,s) (s < t)?  Conditional distribution of number of events Answer N s |N t = n ~ B(n,s/t)

20 hwu Splitting into subprocesses    time 

21 hwu Realisation of a Poisson process # events time

22 hwu X ~ Pn(  ), Y ~ Pn(  ) X,Y independent then we know X + Y ~ Pn(  +  ) Given X + Y = n, what is distribution of X?

23 hwu X ~ Pn(  ), Y ~ Pn(  ) X,Y independent then we know X + Y ~ Pn(  +  ) Given X + Y = n, what is distribution of X? Answer X|X+Y=n ~ B(n,p) where p =  /(  +  )

24 hwu 2.3 Inference for the Poisson distribution N i, i = 1, 2, …, r, i. i. d. Pn(λ), N=ΣN i

25 hwu CI for.

26 hwu 2.4 Dispersion and LR tests for Poisson data Homogeneity hypothesis H 0 : the N i s are i. i. d. Pn( ) (for some unknown ) Dispersion statistic (M = sample mean)

27 hwu Likelihood ratio statistic form for calculation – see p18 ◄◄

28 hwu §3 SINGLE CLASSIFICATIONS Binary classifications (a) N 1, N 2 independent Poisson, with N i ~ Pn( i ) or (b) fixed sample size, N 1 + N 2 = n, with N 1 ~ B(n,p 1 ) where p 1 = 1 /( 1 + 2 )

29 hwu Qualitative categories (a) N 1, N 2, …, N r independent Poisson, with N i ~ Pn(λ i ) or (b) fixed sample size n, with joint multinomial distribution Mn(n;p)

30 hwu Testing goodness of fit H 0 : p i =  i, i = 1,2, …, r This is the (Pearson) chi-square statistic

31 hwu The statistic often appears as

32 hwu

33 An alternative statistic is the LR statistic

34 hwu Sparse data/small expected frequencies ensure m i  1 for all cells, and m i  5 for at least about 80% of the cells if not - combine adjacent cells sensibly

35 hwu Goodness-of-fit tests for frequency distributions - very well-known application of the statistic (see Illustration 3.4 p 22/23)

36 hwu Residuals (standardised)

37 hwu Residuals (standardised) simpler version

38 hwu Number of papers per author 1 2 3 4 5 6 7 8 9 10 11 Number of authors 1062 263 120 50 22 7 6 2 0 1 1 MAJOR ILLUSTRATION 1 Publish and be modelled Model

39 hwu MAJOR ILLUSTRATION 2 Birds in hedges Hedge type i A B C D EF G Hedge length (m) l i 2320 2460 2455 2805 2335 2645 2099 Number of pairs n i 14 16 14 2615 40 71 Model N i ~ Pn( i l i )

40 hwu Example 14 Numbers of mice bearing tumours in treated and control groups TreatedControlTotal Tumours459 No tumours127486 Total167995 §4 TWO-WAY CLASSIFICATIONS

41 hwu Example 15 Patients in clinical trial DrugPlaceboTotal Side-effects15419 No side-effects354681 Total50 100

42 hwu Patients in clinical trial – take 2 DrugPlaceboTotal Side-effects15 30 No side-effects35 70 Total50 100

43 4.1 Factors and responses F × R tables R × F, R × R (F × F ?) Qualitative, ordered, quantitative Analysis the same - interpretation may be different hwu

44 A two-way table is often called a “contingency table” (especially in R  R case). hwu

45 ExposedNot exposedTotal Diseasen 11 n 12 n 1● No diseasen 21 n 22 n 2● Totaln ●1 n ●2 n ●● = n Notation (2  2 case, easily extended)

46 hwu Three possibilities One overall sample, each subject classified according to 2 attributes - this is R × R Retrospective study Prospective study (use of treated and control groups; drug and placebo etc)

47 hwu (a) R × R case (a1) N ij ~ Pn(  ij ), independent or, with fixed table total (a2) Condition on n =  n ij : N|n ~ Mn(n ; p) where N = {N ij }, p = {p ij }. 4.2 Distribution theory and tests for r × s tables

48 hwu (b) F × R case Condition on the observed marginal totals nj =  n ij for the s categories of F (  condition on n and n 1 )  s independent multinomials

49 hwu Usual hypotheses (a1) N ij ~ Pn(  ij ), independent H 0 : variables/responses are independent  ij =  i  j /  = k  i (a2) Multinomial data (table total fixed) H 0 : variables/responses are independent P(row i and column j) = P(row i)P(column j)

50 hwu (b) Condition on n and n j (fixed column totals) N ij ~ Bi( n j, p ij ) j = 1,2, …, s ; independent H 0 : response is homogeneous (p ij = p i for all j) i.e. response has the same distribution for all levels of the factor

51 hwu where m ij = n i  n j /n as before Tests of H 0 The χ 2 (Pearson) statistic:

52 hwu where m ij = n i  n j /n as before Tests of H 0 The χ 2 (Pearson) statistic:

53 hwu OR: test based on the LR statistic Y 2 Illustration: tonsils data – see p27 In R Pearson/X 2 : read data in using “matrix” then use “chisq.test” LR Y 2 : calculate it directly (or get it from the results of fitting a “log-linear model”- see later)

54 hwu Statistical tests (a) Using Pearson’s χ2 4.3 The 2  2 table DrugPlaceboTotal Side-effects15419 No side-effects354681 Total50 100

55 hwu where m ij = n i  n j /n

56 hwu Yates (continuity) correction Subtract 0.5 from |O – E| before squaring it Performing the test in R n.pat=matrix(c(15,35,4,46),2,2) chisq.test(n.pat)

57 hwu (b) Using deviance/LR statistic Y 2 (c) Comparing binomial probabilities (d) Fisher’s exact test

58 hwu DrugPlaceboTotal Side-effects154 N 19 No side-effects354681 Total50 100

59 hwu Under a random allocation one-sided P-value = P(N  4) = 0.0047

60 hwu In the 2  2 table, the H 0 : independence condition is equivalent to  11  22 =  12  21 Let λ = log(  11  22 /  12  21 ) Then we have H 0 : λ = 0 λ is the “log odds ratio” 4.4 Log odds, combining and collapsing tables, interactions

61 hwu The “λ = 0” hypothesis is often called the “no association” hypothesis.

62 hwu The odds ratio is  11  22 /  12  21 Sample equivalent is

63 hwu The odds ratio (or log odds ratio) provides a measure of association for the factors in the table. no association  odds ratio = 1  log odds ratio = 0

64 hwu Don’t combine heterogeneous tables!

65 hwu Interaction An interaction exists between two factors when the effect of one factor is different at different levels of another factor.

68 §5 INTRODUCTION TO GENERALISED LINEAR MODELS (GLMs) Normal linear model Y|x ~ N with E[Y|x]=  +  x or E[Y|x]=  0 +  1 x 1 +  2 x 2 + … +  r x r =  x i.e. E[Y|x] =  (x) =  x

69 hwu We are explaining  (x) using a linear predictor (a linear function of the explanatory data) Generalised linear model Now we set g(  (x)) =  x for some function g We explain g(  (x)) using a linear function of the explanatory data, where g is called the link function

70 hwu e.g. modelling a Poisson mean we use a log link g( ) = log We use a linear predictor to explain log rather than itself : the model is Y|x ~ Pn with mean λ x with log λ x =  +  x or log λ x =  x This is a log-linear model

71 hwu An example is a trend model in which we use log i =  +  i Another example is a cyclic model in which we use log i =  0 +  1 cosθ i +  2 sinθ i

72 hwu §6 MODELS FOR SINGLE CLASSIFICATIONS 6.1 Single classifications - trend models Data: numbers in r categories Model: N i, i = 1, 2, …, r, independent Pn(λ i )

73 hwu Basic case H 0 : λ i ’s equal v H 1 : λ i ’s follow a trend Let X j be category of observation j P(X j = i) = 1/r Test based on see Illustration 6.1

74 hwu A more general model N i independent Pn(λ i ) with Log-linear model

75 hwu It is a linear regression model for logλ i and a non-linear regression model for λ i. It is a generalised linear model. Here the link between the parameter we are estimating and the linear estimator is the log function - it is a “log link”.

76 hwu Fitting in R Example 13: stressful events data >n=c(15,11, …, 1, 4) >r=length(n) >i=1:r

77 hwu >n=c(15,11, …, 1, 4) response vector >r=length(n) >i=1:r explanatory vector model >stress=glm(n~i,family=poisson)

78 hwu >summary(stress) Call: glm(formula = n ~ i, family = poisson) model being fitted Deviance Residuals: Min 1Q Median 3Q Max -1.9886 -0.9631 0.1737 0.5131 2.0362 summary information on the residuals Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.80316 0.14816 18.920 < 2e-16 *** i -0.08377 0.01680 -4.986 6.15e-07 *** information on the fitted parameters

79 hwu Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 50.843 on 17 degrees of freedom Residual deviance: 24.570 on 16 degrees of freedom deviances (Y2 statistics) AIC: 95.825 Number of Fisher Scoring iterations: 4

80 hwu Fitted mean is e.g. for date 6, i = 6 and fitted mean is exp(2.30054) = 9.980

81 hwu Fitted model

82 hwu Test of H 0 : no trend  the null fit, all fitted values equal (to the observed mean) Y 2 = 50.84 (~  2 on 17df) The trend model  fitted values exp(2.80316-0.08377i) Y 2 = 24.57 (~  2 on 16df) Crude 95% CI for slope is -0.084 ± 2(0.0168) i.e. -0.084 ± 0.034

83 hwu The lower the value of the residual deviance, the better in general is the fit of the model.

84 hwu Basic residuals

85 hwu 6.2 Taking into account a deterministic denominator – using an “offset” for the “exposure” Model: N x ~ Pn(λ x ) where E[N x ] = λ x = E x bθ x logλ x = logE x + c + dx See the Gompertz model example (p 40, data in Example 26)

86 hwu We include a term “offset(logE)” in the formula for the linear predictor: in R model = glm(n.deaths ~ age + offset(log(exposure)), family = poisson) Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E)

87 hwu §7 LOGISTIC REGRESSION for modelling proportions we have a binary response for each item and a quantitative explanatory variable for example: dependence of the proportion of insects killed in a chamber on the concentration of a chemical present – we want to predict the proportion killed from the concentration

88 hwu for example: dependence of the proportion of  women who smoke - on age  metal bars on test which fail - on pressure applied  policies which give rise to claims – on sum insured Model: # successes at value x i of explanatory variable: N i ~ bi(n i, π i )

89 hwu We use a glm – we do not predict π i directly; we predict a function of π i called the logit of π i. The logit function is given by: It is the “log odds”.

90 See Illustration 7.1 p 43: proportion v dose

91 logit(proportion) v dose

92 hwu This leads to the “logistic regression” model [ c.f. log linear model N i ~ Poisson(λ i ) with log λ i = a + bx i ]

93 hwu We are using a logit link We use a linear predictor to explain rather than  itself

94 hwu The method based on the use of this model is called logistic regression

95 hwu Data: explanatory # successes group observed variable value size proportion x 1 n 11 n 1 n 11 /n 1 x 2 n 21 n 2 n 21 /n 2 ……. x s n s1 n s n s1 /n s

96 hwu In R we declare the proportion of successes as the response and include the group sizes as a set of weights drug.mod1 = glm(propdead ~ dose, weights = groupsize, family = binomial) explanatory vector is dose note the family declaration

97 hwu RHS of model can be extended if required to include additional explanatory variables and factors e.g. mod3 = glm(mat3 ~ age+socialclass+gender)

98 hwu drug.mod – see output p44 Coefficients very highly significant (***) Null deviance 298 on 9df Residual deviance 17.2 on 8df But … residual v fitted plot and … fitted v observed proportions plot

101 model with a quadratic term (dose^2)

102 hwu 8.1 Log-linear models for two-way classifications N ij ~ Pn(  ij ), i= 1,2, …, r ; j = 1,2, …, s H 0 : variables are independent  ij =  i  j /  §8 MODELS FOR TWO-WAY AND THREE-WAY CLASSIFICATIONS

103 hwu  log  ij = log  i + log  j  log     row effect  overall effect  column effect

104 hwu We “explain” log  ij in terms of additive effects: log  ij =  + α i + β j Fitted values are the expected frequencies Fitting process gives us the value of Y 2 = -2logλ

105 hwu N ij ~ Pn(  ij ), independent, with log  ij =  + α i + β j Declare the response vector (the cell frequencies) and the row/column codes as factors then use > name = glm(…) Fitting a log-linear model

106 hwu Tonsils data (Example 16) n.tonsils = c(19,497,29,560,24,269) rc = factor(c(1,2,1,2,1,2)) cc = factor(c(1,1,2,2,3,3)) tonsils.mod1 = glm(n.tonsils ~ rc + cc, family=poisson)

107 Call: glm(formula = n.tonsils2 ~ rc + cc, family = poisson) Deviance Residuals: 1 2 3 4 5 6 -1.54915 0.34153 -0.24416 0.05645 2.11018 -0.53736 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.27998 0.12287 26.696 < 2e-16 *** rc2 2.91326 0.12094 24.087 < 2e-16 *** cc2 0.13232 0.06030 2.195 0.0282 * cc3 -0.56593 0.07315 -7.737 1.02e-14 *** --- Null deviance: 1487.217 on 5 degrees of freedom Residual deviance: 7.321 on 2 degrees of freedom  Y 2 = - 2logλ

108 hwu The fit of the “independent attributes” model is not good

109 hwu > n.patients = c(15, 4, 35, 46) > rc = factor(c(1, 1, 2, 2)) > cc = factor(c(1, 2, 1, 2)) > pat.mod1 = glm(n.patients ~ rc + cc, family = poisson) Patients data (Example 15)

110 Call: glm(formula = n.patients ~ rc + cc, family = poisson) Deviance Residuals: 1 2 3 4 1.6440 -2.0199 -0.8850 0.8457 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.251e+00 2.502e-01 8.996 < 2e-16 *** rc2 1.450e+00 2.549e-01 5.689 1.28e-08 *** cc2 2.184e-10 2.000e-01 1.09e-09 1 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 49.6661 on 3 degrees of freedom Residual deviance: 8.2812 on 1 degrees of freedom AIC: 33.172

111 hwu fitted coefficients: coef(pat.mod1) (Intercept) rc2 cc2 2.251292e+00 1.450010e+00 2.183513e-10 fitted values: fitted(pat.mod1) 1 2 3 4 9.5 9.5 40.5 40.5

112 hwu Estimates are Predictors for cells 1,1 and 1,2 are 2.251292 : exp(2.251292) = 9.5 exp(3.701302) = 40.5 Predictors for cells 2,1 and 2,2 are 2.251292 + 1.450010 = 3.701302 :

113 hwu Residual deviance: 8.2812 on 1 degree of freedom  Y 2 for testing the model i.e. for testing H 0 : response is homogeneous/ column distributions are the same/ no association between response and treatment group The lower the value of the residual deviance, the better in general is the fit of the model. Here the fit of the additive model is very poor (we have of course already concluded that there is an association – P-value about 1%).

114 hwu 8.2 Two-way classifications - taking into account a deterministic denominator See the grouse data (Illustration 8.3 p50, data in Example 25) Model: N ij ~ Pn(λ ij ) where E[N ij ] = λ ij = E ij exp(  + α i + β j ) logE[N ij /E ij ] =  + α i + β j i.e. logλ ij = logE ij +  + α i + β j

115 hwu We include a term “offset(logE)” in the formula for the linear predictor Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E)

116 hwu 8.3 Log-linear models for three-way classifications Each subject classified according to 3 factors/variables with r,s,t levels respecitvely N ijk ~ Pn(  ijk ) with log  ijk =  + α i + β j + γ k + (αβ) ij + (αγ) ik + (βγ) jk + (αβγ) ijk r  s  t parameters

117 hwu Model with two factors and an interaction (no longer additive) is log  ij =  + α i + β j + (αβ) ij Recall “interaction”

118 hwu Range of possible models/dependencies From 1 Complete independence model formula: A + B + C link: log  ijk =  + α i + β j + γ k notation: [A][B][C] df: rst – r – s – t + 2 8.4 Hierarchic log-linear models Interpretation!

119 hwu …. through 2 One interaction (B and C say) model formula: A + B*C link: log  ijk =  + α i + β j + γ k + (βγ) jk notation: [A][BC] df: rst – r – st + 1

120 hwu …. to 5 All possible interactions model formula: A*B*C notation: [ABC] df: 0

121 hwu Model selection: by backward elimination or forward selection through the hierarchy of models containing all 3 variables

122 hwu saturated [ABC] [AB] [AC] [BC] [AB] [AC] [AB] [BC] [AC][BC] [AB] [C] [A] [BC] [AC] [B] [A] [B] [C] independence

123 hwu Our models can include mean (intercept) + factor effects + 2-way interactions + 3-way interaction

124 hwu Illustration 8.4 Models for lizards data (Example 29) liz = array(c(32, 86, 11, 35, 61, 73, 41, 70), dim = c(2, 2, 2)) n.liz = as.vector(liz) s = factor(c(1,1,1,1,2,2,2,2))  species d = factor(c(1, 1, 2, 2, 1, 1, 2, 2))  diameter of perch h = factor(c(1,2,1,2,1,2,1,2))  height of perch

125 hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) liz.mod3 = glm(n.liz ~ s + d*h, family = poisson) liz.mod4 = glm(n.liz ~ s*h + d, family = poisson) liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson) liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson )

126 hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) 25.04 on 4df liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † 12.43 on 3df liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson) liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson )

127 hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)† 2.03 on 2df

128 hwu > summary(liz.mod5) Call: glm(formula = n.liz ~ s * d + s * h, family = poisson) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.4320 0.1601 21.436 < 2e-16 *** s2 0.5895 0.1970 2.992 0.002769 ** d2 -0.9420 0.1738 -5.420 5.97e-08 *** h2 1.0346 0.1775 5.827 5.63e-09 *** s2:d2 0.7537 0.2161 3.488 0.000486 *** s2:h2 -0.6967 0.2198 -3.170 0.001526 ** Null deviance: 98.5830 on 7 degrees of freedom Residual deviance: 2.0256 on 2 degrees of freedom

131 FIN

132 hwu Number of papers per author 1 2 3 4 5 6 7 8 9 10 11 Number of authors 1062 263 120 50 22 7 6 2 0 1 1 MAJOR ILLUSTRATION 1 Model

136 MAJOR ILLUSTRATION 2 Hedge type i A B C D EF G Hedge length (m) l i 2320 2460 2455 2805 2335 2645 2099 Number of pairs n i 14 16 14 2615 40 71 Model N i ~ Pn( i l i )

138 Cyclic models

139 hwu Model N i independent Pn(λ i ) with Explanatory variable: the category/month i has been transformed into an angle  i

140 hwu It is another example of a non-linear regression model for Poisson responses. It is a generalised linear model.

141 hwu Fitting in R >n=c(40, 34, …, 33, 38) response vector >r=length(n) >i=1:r >th=2*pi*i/r explanatory vector model >leuk=glm(n~cos(th) + sin(th),family=poisson)

142 hwu Fitted mean is

143 hwu Fitted model

144 hwu MaleFemale Cinema often2221 Not often2012 F73DB3 CDA Data from class

145 hwu MaleFemale Cinema often222143 Not often201232 423375

146 MaleFemale Cinema often222143 Not often201232 423375 P(often|male) = 22/42 = 0.524 P(often|female) = 21/33 = 0.636 significant difference (on these numbers)? is there an association between gender and cinema attendance?

147 hwu Null hypothesis H 0 : no association between gender and cinema attendance Alternative: not H 0 Under H 0 we expect 42  43/75 = 24.08 in cell 1,1 etc.

148 hwu > matcinema=matrix(c(22,20,21,12),2,2) > chisq.test(matcinema) Pearson's Chi-squared test with Yates' continuity correction data: matcinema X-squared = 0.5522, df = 1, p-value = 0.4574 > chisq.test(matcinema)$expected [,1] [,2] [1,] 24.08 18.92 [2,] 17.92 14.08

149 hwu > matcinema=matrix(c(22,20,21,12),2,2) > chisq.test(matcinema) Pearson's Chi-squared test with Yates' continuity correction data: matcinema X-squared = 0.5522, df = 1, p-value = 0.4574 > chisq.test(matcinema)$expected [,1] [,2] null hypothesis can stand [1,] 24.08 18.92 no association between gender [2,] 17.92 14.08 and cinema attendance

150 hwu MaleFemale Cinema often110105215 Not often10060160 210165 P(often|male) = 110/210 = 0.524 P(often|female) = 105/60 = 0.636 significant difference (on these numbers)? more students, same proportions

151 hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2

152 hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 X-squared = 4.3361, df = 1, p-value = 0.03731 > chisq.test(matcinema2)$expected [,1] [,2] [1,] 120.4 94.6 [2,] 89.6 70.4

153 hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 X-squared = 4.3361, df = 1, p-value = 0.03731 > chisq.test(matcinema2)$expected [,1] [,2] null hypothesis is rejected [1,] 120.4 94.6 there IS an association between [2,] 89.6 70.4 gender and cinema attendance

154 hwu FIN

