F73DB3 CATEGORICAL DATA ANALYSIS Workbook Contents page Preface Aims Summary Content/structure/syllabus plus other information Background – computing (R)

Slides:



Advertisements
Similar presentations
Chapter 2 Describing Contingency Tables Reported by Liu Qi.
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
1 Contingency Tables: Tests for independence and homogeneity (§10.5) How to test hypotheses of independence (association) and homogeneity (similarity)
Logistic Regression Example: Horseshoe Crab Data
Loglinear Models for Contingency Tables. Consider an IxJ contingency table that cross- classifies a multinomial sample of n subjects on two categorical.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Generalised linear models
The Simple Regression Model
Final Review Session.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Inferences About Process Quality
Log-linear analysis Summary. Focus on data analysis Focus on underlying process Focus on model specification Focus on likelihood approach Focus on ‘complete-data.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Logistic Regression Logistic Regression - Dichotomous Response variable and numeric and/or categorical explanatory variable(s) –Goal: Model the probability.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Logistic Regression and Generalized Linear Models:
AS 737 Categorical Data Analysis For Multivariate
Categorical Data Prof. Andy Field.
© Department of Statistics 2012 STATS 330 Lecture 28: Slide 1 Stats 330: Lecture 28.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 26 Comparing Counts.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
Multinomial Distribution
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Introduction Many experiments result in measurements that are qualitative or categorical rather than quantitative. Humans classified by ethnic origin Hair.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
1 GCRC Data Analysis with SPSS Workshop Session 5 Follow Up on FEV data Binary and Categorical Outcomes 2x2 tables 2xK tables JxK tables Logistic Regression.
Analysis of Qualitative Data Dr Azmi Mohd Tamil Dept of Community Health Universiti Kebangsaan Malaysia FK6163.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Copyright © 2010 Pearson Education, Inc. Slide
Chapter 13 Inference for Counts: Chi-Square Tests © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Logistic regression. Recall the simple linear regression model: y =  0 +  1 x +  where we are trying to predict a continuous dependent variable y from.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
1 STA 617 – Chp10 Models for matched pairs Summary  Describing categorical random variable – chapter 1  Poisson for count data  Binomial for binary.
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
THE CHI-SQUARE TEST BACKGROUND AND NEED OF THE TEST Data collected in the field of medicine is often qualitative. --- For example, the presence or absence.
Categorical Data Analysis
Logistic Regression and Odds Ratios Psych DeShon.
Fall 2002Biostat Inference for two-way tables General R x C tables Tests of homogeneity of a factor across groups or independence of two factors.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 11 Multinomial Experiments and Contingency Tables 11-1 Overview 11-2 Multinomial Experiments:
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: Multiple, Logistic and Proportional Hazards Regression.
Transforming the data Modified from:
BINARY LOGISTIC REGRESSION
CHAPTER 7 Linear Correlation & Regression Methods
Categorical Data Aims Loglinear models Categorical data
CHAPTER 12 More About Regression
Joyful mood is a meritorious deed that cheers up people around you
Presentation transcript:

F73DB3 CATEGORICAL DATA ANALYSIS Workbook Contents page Preface Aims Summary Content/structure/syllabus plus other information Background – computing (R) hwu

Examples Single classifications (1-13) Two-way classifications (14-27) Three-way classifications (28-32) hwu

Example 1 Eye colours Colour A B C D Frequency observed

hwu Example 2 Prussian cavalry deaths (a)Numbers killed in each unit in each year - frequency table Number killed  5 Total Frequency observed

hwu Example 2 Prussian cavalry deaths (b) Numbers killed in each unit in each year – raw data …

hwu Example 2 Prussian cavalry deaths (c) Total numbers killed each year 1875 ’76 ’77 ’78 ’79 ’80 ’81 ’82 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ‘

hwu Example 4 Political views (very L) (centre) (very R) Don’t Know Total

hwu Example 7 Vehicle repair visits Number of visits  6 Total Frequency observed

hwu Example 15 Patients in clinical trial DrugPlaceboTotal Side-effects15419 No side-effects Total50 100

§1 INTRODUCTION Data are counts/frequencies (not measurements) Categories (explanatory variable) Distribution in the cells (response) Frequency distribution Single classifications Two-way classifications hwu

B: Cause of death CancerOther A: Smoking status Smoker3020 Not smoker1535 Illustration 1.1

Data may arise as Bernoulli/binomial data (2 outcomes) Multinomial data (more than 2 outcomes) Poisson data [+ Negative binomial data – the version with range x = 0,1,2, …] hwu

§2 POISSON PROCESS AND ASSOCIATED DISTRIBUTIONS

hwu 2.1 Bernoulli trials and related distributions Number of successes – binomial distribution [Time before k th success – negative binomial distribution Time to first success – geometric distribution] Conditional distribution of success times

hwu 2.2 Poisson process and related distributions    time 

hwu Poisson process with rate λ Number of events in a time interval of length t, N t, has a Poisson distribution with mean t

hwu Poisson process with rate λ Inter-event time, T, has an exponential distribution with parameter ( mean 1/ )

hwu given n events in time (0,t)  how many in time (0,s) (s < t)?  Conditional distribution of number of events

hwu given n events in time (0,t)  how many in time (0,s) (s < t)?  Conditional distribution of number of events Answer N s |N t = n ~ B(n,s/t)

hwu Splitting into subprocesses    time 

hwu Realisation of a Poisson process # events time

hwu X ~ Pn(  ), Y ~ Pn(  ) X,Y independent then we know X + Y ~ Pn(  +  ) Given X + Y = n, what is distribution of X?

hwu X ~ Pn(  ), Y ~ Pn(  ) X,Y independent then we know X + Y ~ Pn(  +  ) Given X + Y = n, what is distribution of X? Answer X|X+Y=n ~ B(n,p) where p =  /(  +  )

hwu 2.3 Inference for the Poisson distribution N i, i = 1, 2, …, r, i. i. d. Pn(λ), N=ΣN i

hwu CI for.

hwu 2.4 Dispersion and LR tests for Poisson data Homogeneity hypothesis H 0 : the N i s are i. i. d. Pn( ) (for some unknown ) Dispersion statistic (M = sample mean)

hwu Likelihood ratio statistic form for calculation – see p18 ◄◄

hwu §3 SINGLE CLASSIFICATIONS Binary classifications (a) N 1, N 2 independent Poisson, with N i ~ Pn( i ) or (b) fixed sample size, N 1 + N 2 = n, with N 1 ~ B(n,p 1 ) where p 1 = 1 /( )

hwu Qualitative categories (a) N 1, N 2, …, N r independent Poisson, with N i ~ Pn(λ i ) or (b) fixed sample size n, with joint multinomial distribution Mn(n;p)

hwu Testing goodness of fit H 0 : p i =  i, i = 1,2, …, r This is the (Pearson) chi-square statistic

hwu The statistic often appears as

hwu

An alternative statistic is the LR statistic

hwu Sparse data/small expected frequencies ensure m i  1 for all cells, and m i  5 for at least about 80% of the cells if not - combine adjacent cells sensibly

hwu Goodness-of-fit tests for frequency distributions - very well-known application of the statistic (see Illustration 3.4 p 22/23)

hwu Residuals (standardised)

hwu Residuals (standardised) simpler version

hwu Number of papers per author Number of authors MAJOR ILLUSTRATION 1 Publish and be modelled Model

hwu MAJOR ILLUSTRATION 2 Birds in hedges Hedge type i A B C D EF G Hedge length (m) l i Number of pairs n i Model N i ~ Pn( i l i )

hwu Example 14 Numbers of mice bearing tumours in treated and control groups TreatedControlTotal Tumours459 No tumours Total §4 TWO-WAY CLASSIFICATIONS

hwu Example 15 Patients in clinical trial DrugPlaceboTotal Side-effects15419 No side-effects Total50 100

hwu Patients in clinical trial – take 2 DrugPlaceboTotal Side-effects15 30 No side-effects35 70 Total50 100

4.1 Factors and responses F × R tables R × F, R × R (F × F ?) Qualitative, ordered, quantitative Analysis the same - interpretation may be different hwu

A two-way table is often called a “contingency table” (especially in R  R case). hwu

ExposedNot exposedTotal Diseasen 11 n 12 n 1● No diseasen 21 n 22 n 2● Totaln ●1 n ●2 n ●● = n Notation (2  2 case, easily extended)

hwu Three possibilities One overall sample, each subject classified according to 2 attributes - this is R × R Retrospective study Prospective study (use of treated and control groups; drug and placebo etc)

hwu (a) R × R case (a1) N ij ~ Pn(  ij ), independent or, with fixed table total (a2) Condition on n =  n ij : N|n ~ Mn(n ; p) where N = {N ij }, p = {p ij }. 4.2 Distribution theory and tests for r × s tables

hwu (b) F × R case Condition on the observed marginal totals nj =  n ij for the s categories of F (  condition on n and n 1 )  s independent multinomials

hwu Usual hypotheses (a1) N ij ~ Pn(  ij ), independent H 0 : variables/responses are independent  ij =  i  j /  = k  i (a2) Multinomial data (table total fixed) H 0 : variables/responses are independent P(row i and column j) = P(row i)P(column j)

hwu (b) Condition on n and n j (fixed column totals) N ij ~ Bi( n j, p ij ) j = 1,2, …, s ; independent H 0 : response is homogeneous (p ij = p i for all j) i.e. response has the same distribution for all levels of the factor

hwu where m ij = n i  n j /n as before Tests of H 0 The χ 2 (Pearson) statistic:

hwu where m ij = n i  n j /n as before Tests of H 0 The χ 2 (Pearson) statistic:

hwu OR: test based on the LR statistic Y 2 Illustration: tonsils data – see p27 In R Pearson/X 2 : read data in using “matrix” then use “chisq.test” LR Y 2 : calculate it directly (or get it from the results of fitting a “log-linear model”- see later)

hwu Statistical tests (a) Using Pearson’s χ2 4.3 The 2  2 table DrugPlaceboTotal Side-effects15419 No side-effects Total50 100

hwu where m ij = n i  n j /n

hwu Yates (continuity) correction Subtract 0.5 from |O – E| before squaring it Performing the test in R n.pat=matrix(c(15,35,4,46),2,2) chisq.test(n.pat)

hwu (b) Using deviance/LR statistic Y 2 (c) Comparing binomial probabilities (d) Fisher’s exact test

hwu DrugPlaceboTotal Side-effects154 N 19 No side-effects Total50 100

hwu Under a random allocation one-sided P-value = P(N  4) =

hwu In the 2  2 table, the H 0 : independence condition is equivalent to  11  22 =  12  21 Let λ = log(  11  22 /  12  21 ) Then we have H 0 : λ = 0 λ is the “log odds ratio” 4.4 Log odds, combining and collapsing tables, interactions

hwu The “λ = 0” hypothesis is often called the “no association” hypothesis.

hwu The odds ratio is  11  22 /  12  21 Sample equivalent is

hwu The odds ratio (or log odds ratio) provides a measure of association for the factors in the table. no association  odds ratio = 1  log odds ratio = 0

hwu Don’t combine heterogeneous tables!

hwu Interaction An interaction exists between two factors when the effect of one factor is different at different levels of another factor.

hwu

§5 INTRODUCTION TO GENERALISED LINEAR MODELS (GLMs) Normal linear model Y|x ~ N with E[Y|x]=  +  x or E[Y|x]=  0 +  1 x 1 +  2 x 2 + … +  r x r =  x i.e. E[Y|x] =  (x) =  x

hwu We are explaining  (x) using a linear predictor (a linear function of the explanatory data) Generalised linear model Now we set g(  (x)) =  x for some function g We explain g(  (x)) using a linear function of the explanatory data, where g is called the link function

hwu e.g. modelling a Poisson mean we use a log link g( ) = log We use a linear predictor to explain log rather than itself : the model is Y|x ~ Pn with mean λ x with log λ x =  +  x or log λ x =  x This is a log-linear model

hwu An example is a trend model in which we use log i =  +  i Another example is a cyclic model in which we use log i =  0 +  1 cosθ i +  2 sinθ i

hwu §6 MODELS FOR SINGLE CLASSIFICATIONS 6.1 Single classifications - trend models Data: numbers in r categories Model: N i, i = 1, 2, …, r, independent Pn(λ i )

hwu Basic case H 0 : λ i ’s equal v H 1 : λ i ’s follow a trend Let X j be category of observation j P(X j = i) = 1/r Test based on see Illustration 6.1

hwu A more general model N i independent Pn(λ i ) with Log-linear model

hwu It is a linear regression model for logλ i and a non-linear regression model for λ i. It is a generalised linear model. Here the link between the parameter we are estimating and the linear estimator is the log function - it is a “log link”.

hwu Fitting in R Example 13: stressful events data >n=c(15,11, …, 1, 4) >r=length(n) >i=1:r

hwu >n=c(15,11, …, 1, 4) response vector >r=length(n) >i=1:r explanatory vector model >stress=glm(n~i,family=poisson)

hwu >summary(stress) Call: glm(formula = n ~ i, family = poisson) model being fitted Deviance Residuals: Min 1Q Median 3Q Max summary information on the residuals Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** i e-07 *** information on the fitted parameters

hwu Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: on 17 degrees of freedom Residual deviance: on 16 degrees of freedom deviances (Y2 statistics) AIC: Number of Fisher Scoring iterations: 4

hwu Fitted mean is e.g. for date 6, i = 6 and fitted mean is exp( ) = 9.980

hwu Fitted model

hwu Test of H 0 : no trend  the null fit, all fitted values equal (to the observed mean) Y 2 = (~  2 on 17df) The trend model  fitted values exp( i) Y 2 = (~  2 on 16df) Crude 95% CI for slope is ± 2(0.0168) i.e ± 0.034

hwu The lower the value of the residual deviance, the better in general is the fit of the model.

hwu Basic residuals

hwu 6.2 Taking into account a deterministic denominator – using an “offset” for the “exposure” Model: N x ~ Pn(λ x ) where E[N x ] = λ x = E x bθ x logλ x = logE x + c + dx See the Gompertz model example (p 40, data in Example 26)

hwu We include a term “offset(logE)” in the formula for the linear predictor: in R model = glm(n.deaths ~ age + offset(log(exposure)), family = poisson) Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E)

hwu §7 LOGISTIC REGRESSION for modelling proportions we have a binary response for each item and a quantitative explanatory variable for example: dependence of the proportion of insects killed in a chamber on the concentration of a chemical present – we want to predict the proportion killed from the concentration

hwu for example: dependence of the proportion of  women who smoke - on age  metal bars on test which fail - on pressure applied  policies which give rise to claims – on sum insured Model: # successes at value x i of explanatory variable: N i ~ bi(n i, π i )

hwu We use a glm – we do not predict π i directly; we predict a function of π i called the logit of π i. The logit function is given by: It is the “log odds”.

See Illustration 7.1 p 43: proportion v dose

logit(proportion) v dose

hwu This leads to the “logistic regression” model [ c.f. log linear model N i ~ Poisson(λ i ) with log λ i = a + bx i ]

hwu We are using a logit link We use a linear predictor to explain rather than  itself

hwu The method based on the use of this model is called logistic regression

hwu Data: explanatory # successes group observed variable value size proportion x 1 n 11 n 1 n 11 /n 1 x 2 n 21 n 2 n 21 /n 2 ……. x s n s1 n s n s1 /n s

hwu In R we declare the proportion of successes as the response and include the group sizes as a set of weights drug.mod1 = glm(propdead ~ dose, weights = groupsize, family = binomial) explanatory vector is dose note the family declaration

hwu RHS of model can be extended if required to include additional explanatory variables and factors e.g. mod3 = glm(mat3 ~ age+socialclass+gender)

hwu drug.mod – see output p44 Coefficients very highly significant (***) Null deviance 298 on 9df Residual deviance 17.2 on 8df But … residual v fitted plot and … fitted v observed proportions plot

hwu

model with a quadratic term (dose^2)

hwu 8.1 Log-linear models for two-way classifications N ij ~ Pn(  ij ), i= 1,2, …, r ; j = 1,2, …, s H 0 : variables are independent  ij =  i  j /  §8 MODELS FOR TWO-WAY AND THREE-WAY CLASSIFICATIONS

hwu  log  ij = log  i + log  j  log     row effect  overall effect  column effect

hwu We “explain” log  ij in terms of additive effects: log  ij =  + α i + β j Fitted values are the expected frequencies Fitting process gives us the value of Y 2 = -2logλ

hwu N ij ~ Pn(  ij ), independent, with log  ij =  + α i + β j Declare the response vector (the cell frequencies) and the row/column codes as factors then use > name = glm(…) Fitting a log-linear model

hwu Tonsils data (Example 16) n.tonsils = c(19,497,29,560,24,269) rc = factor(c(1,2,1,2,1,2)) cc = factor(c(1,1,2,2,3,3)) tonsils.mod1 = glm(n.tonsils ~ rc + cc, family=poisson)

Call: glm(formula = n.tonsils2 ~ rc + cc, family = poisson) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** rc < 2e-16 *** cc * cc e-14 *** --- Null deviance: on 5 degrees of freedom Residual deviance: on 2 degrees of freedom  Y 2 = - 2logλ

hwu The fit of the “independent attributes” model is not good

hwu > n.patients = c(15, 4, 35, 46) > rc = factor(c(1, 1, 2, 2)) > cc = factor(c(1, 2, 1, 2)) > pat.mod1 = glm(n.patients ~ rc + cc, family = poisson) Patients data (Example 15)

Call: glm(formula = n.patients ~ rc + cc, family = poisson) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.251e e < 2e-16 *** rc e e e-08 *** cc e e e Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: on 3 degrees of freedom Residual deviance: on 1 degrees of freedom AIC:

hwu fitted coefficients: coef(pat.mod1) (Intercept) rc2 cc e e e-10 fitted values: fitted(pat.mod1)

hwu Estimates are Predictors for cells 1,1 and 1,2 are : exp( ) = 9.5 exp( ) = 40.5 Predictors for cells 2,1 and 2,2 are = :

hwu Residual deviance: on 1 degree of freedom  Y 2 for testing the model i.e. for testing H 0 : response is homogeneous/ column distributions are the same/ no association between response and treatment group The lower the value of the residual deviance, the better in general is the fit of the model. Here the fit of the additive model is very poor (we have of course already concluded that there is an association – P-value about 1%).

hwu 8.2 Two-way classifications - taking into account a deterministic denominator See the grouse data (Illustration 8.3 p50, data in Example 25) Model: N ij ~ Pn(λ ij ) where E[N ij ] = λ ij = E ij exp(  + α i + β j ) logE[N ij /E ij ] =  + α i + β j i.e. logλ ij = logE ij +  + α i + β j

hwu We include a term “offset(logE)” in the formula for the linear predictor Fitted value is the estimate of the expected response per unit of exposure (i.e. per unit of the offset E)

hwu 8.3 Log-linear models for three-way classifications Each subject classified according to 3 factors/variables with r,s,t levels respecitvely N ijk ~ Pn(  ijk ) with log  ijk =  + α i + β j + γ k + (αβ) ij + (αγ) ik + (βγ) jk + (αβγ) ijk r  s  t parameters

hwu Model with two factors and an interaction (no longer additive) is log  ij =  + α i + β j + (αβ) ij Recall “interaction”

hwu Range of possible models/dependencies From 1 Complete independence model formula: A + B + C link: log  ijk =  + α i + β j + γ k notation: [A][B][C] df: rst – r – s – t Hierarchic log-linear models Interpretation!

hwu …. through 2 One interaction (B and C say) model formula: A + B*C link: log  ijk =  + α i + β j + γ k + (βγ) jk notation: [A][BC] df: rst – r – st + 1

hwu …. to 5 All possible interactions model formula: A*B*C notation: [ABC] df: 0

hwu Model selection: by backward elimination or forward selection through the hierarchy of models containing all 3 variables

hwu saturated [ABC] [AB] [AC] [BC] [AB] [AC] [AB] [BC] [AC][BC] [AB] [C] [A] [BC] [AC] [B] [A] [B] [C] independence

hwu Our models can include mean (intercept) + factor effects + 2-way interactions + 3-way interaction

hwu Illustration 8.4 Models for lizards data (Example 29) liz = array(c(32, 86, 11, 35, 61, 73, 41, 70), dim = c(2, 2, 2)) n.liz = as.vector(liz) s = factor(c(1,1,1,1,2,2,2,2))  species d = factor(c(1, 1, 2, 2, 1, 1, 2, 2))  diameter of perch h = factor(c(1,2,1,2,1,2,1,2))  height of perch

hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) liz.mod3 = glm(n.liz ~ s + d*h, family = poisson) liz.mod4 = glm(n.liz ~ s*h + d, family = poisson) liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson) liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson )

hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) on 4df liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † on 3df liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson) liz.mod6 = glm(n.liz ~ s*d + d*h, family = poisson )

hwu Forward selection liz.mod1 = glm(n.liz ~ s + d + h, family = poisson) liz.mod2 = glm(n.liz ~ s*d + h, family = poisson) † liz.mod5 = glm(n.liz ~ s*d + s*h, family = poisson)† 2.03 on 2df

hwu > summary(liz.mod5) Call: glm(formula = n.liz ~ s * d + s * h, family = poisson) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) < 2e-16 *** s ** d e-08 *** h e-09 *** s2:d *** s2:h ** Null deviance: on 7 degrees of freedom Residual deviance: on 2 degrees of freedom

hwu

FIN

hwu Number of papers per author Number of authors MAJOR ILLUSTRATION 1 Model

hwu

MAJOR ILLUSTRATION 2 Hedge type i A B C D EF G Hedge length (m) l i Number of pairs n i Model N i ~ Pn( i l i )

hwu

Cyclic models

hwu Model N i independent Pn(λ i ) with Explanatory variable: the category/month i has been transformed into an angle  i

hwu It is another example of a non-linear regression model for Poisson responses. It is a generalised linear model.

hwu Fitting in R >n=c(40, 34, …, 33, 38) response vector >r=length(n) >i=1:r >th=2*pi*i/r explanatory vector model >leuk=glm(n~cos(th) + sin(th),family=poisson)

hwu Fitted mean is

hwu Fitted model

hwu MaleFemale Cinema often2221 Not often2012 F73DB3 CDA Data from class

hwu MaleFemale Cinema often Not often

MaleFemale Cinema often Not often P(often|male) = 22/42 = P(often|female) = 21/33 = significant difference (on these numbers)? is there an association between gender and cinema attendance?

hwu Null hypothesis H 0 : no association between gender and cinema attendance Alternative: not H 0 Under H 0 we expect 42  43/75 = in cell 1,1 etc.

hwu > matcinema=matrix(c(22,20,21,12),2,2) > chisq.test(matcinema) Pearson's Chi-squared test with Yates' continuity correction data: matcinema X-squared = , df = 1, p-value = > chisq.test(matcinema)$expected [,1] [,2] [1,] [2,]

hwu > matcinema=matrix(c(22,20,21,12),2,2) > chisq.test(matcinema) Pearson's Chi-squared test with Yates' continuity correction data: matcinema X-squared = , df = 1, p-value = > chisq.test(matcinema)$expected [,1] [,2] null hypothesis can stand [1,] no association between gender [2,] and cinema attendance

hwu MaleFemale Cinema often Not often P(often|male) = 110/210 = P(often|female) = 105/60 = significant difference (on these numbers)? more students, same proportions

hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2

hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 X-squared = , df = 1, p-value = > chisq.test(matcinema2)$expected [,1] [,2] [1,] [2,]

hwu > matcinema2=matrix(c(110,100,105,60),2,2) > chisq.test(matcinema2) Pearson's Chi-squared test with Yates' continuity correction data: matcinema2 X-squared = , df = 1, p-value = > chisq.test(matcinema2)$expected [,1] [,2] null hypothesis is rejected [1,] there IS an association between [2,] gender and cinema attendance

hwu FIN