Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.

Similar presentations


Presentation on theme: "The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The."— Presentation transcript:

1 The Analysis of Categorical Data

2 Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The data in such a study represents counts –or frequencies - of observations in each category

3 Analysis DataAnalysis A single categorical predictor variable Organized as two way contingency tables, and tested with chi-square or G-test Multiple predictor variables (or complex models) Organized as a multi- way contingency tables, and analyzed using either log-linear models or classification trees

4 Two way Contingency Tables Analysis of contingency tables is done correctly only on the raw counts, not on the percentages, proportions, or relative frequencies of the data

5 Wildebeest carcasses from the Serengeti (Sinclair and Arcese 1995)

6 Sex, cause of death, and bone marrow type Sex (males / females) Cause of death (predation / other) Bone marrow type: 1.Solid white fatty (healthy animal) 2.Opaque gelatinous 3.Translucent gelatinous

7 Data SexMarrowDeath by predation MaleSWFYes MaleOGYes MaleTGYes ………

8 Brief format SEXMARROWDEATHCOUNT FEMALESWFPRED26 MALESWFPRED14 FEMALEOGPRED32 MALEOGPRED43 FEMALETGPRED8 MALETGPRED10 FEMALESWFNPRED6 MALESWFNPRED7 FEMALEOGNPRED26 MALEOGNPRED12 FEMALETGNPRED16 MALETGNPRED26

9 Contingency table Sex * Death Crosstabulation Dead SexNPREDPREDTotal FEMALE4866114 MALE4567112 Total93133226

10 Contingency table Sex * Marrow Crosstabulation Marrow SexOGSWFTGTotal FEMALE583224114 MALE552136112 Total1135360226

11 Contingency table Death * Marrow Crosstabulation Marrow DeathOGSWFTGTotal NPRED38134293 PRED754018133 Total1135360226

12 Are the variables independent? We want to know, for example, whether males are more likely to die by predation than females Specifying the null hypothesis: The predictor and response variable are not associated with each other. The two variables are independent of each other and the observed degree of association is not stronger than we would expect by chance or random sampling

13 Calculating the expected values The expected value is the total number of observations (N) times the probability of a population being both males and dead by predation

14 The probability of two independent events Because we have no other information than the data, we estimate the probabilities of each of the right hand terms from the equation from the marginal totals

15 Contingency table Sex * Death expected values Dead SexNPREDPREDP FEMALE46.9167.091140.5044 MALE46.0965.911120.4956 93133 P0.41150.5885N=226

16

17 Testing the hypothesis: Pearson’s Chi-square test = 0.0866, P=0.7685 = 0.0253, P=0.8736

18 The degrees of freedom = 1

19 Calculating the P-value We find the probability of obtaining a value of Χ 2 as large or larger than 0.0866 relative to a Χ 2 distribution with 1 degree of freedom P = 0.769

20

21 An alternative The likelihood ratio test: It compares observed values with the distribution of expected values based on the multinomial probability distribution = 0.0866

22 Two way contingency tables Sex * Death Crosstabulation: Sex * Marrow Crosstabulation: Marrow * Death Crosstabulation:

23 Which test to chose? ModelRows/ ColumnsSample size Test I II Not fixed Fixed/not fixed smallG-test, with corrections I II Not fixed Fixed/not fixed largeG-test, Chi square test IIIFixedFisher exact test

24 Log-linear models Multi-way Contingency Tables

25 Multiple two-way tables FemalesMarrow DeathOGSWFTGTotal PRED3226866 NPRED2661648 Total583224114 MalesMarrow DeathOGSWFTGTotal PRED43141067 NPRED1272645 Total552136112

26 Log-linear models They treat the cell frequencies as counts distributed as a Poisson random variable The expected cell frequencies are modeled against the variables using the log-link and Poisson error term They are fit and parameters estimated using maximum likelihood techniques

27 Log-linear models Do not distinguish response and predictor variables: all the variables are considered equally as response variables

28 However A logit model with categorical variables can be analyzed as a log-linear model

29 Two way tables For a two way table (I by J) we can fit two log- linear models The first is a saturated (full) model Log f ij = constant + λ i x + λ k y + λ jk xy f ij = is the expected frequency in cell ij λ i x = is the effect of category i of variable X λ k y = is the effect of category k of variable Y λ jk xy = is the effect any interaction between X and Y This model fit the observed frequencies perfectly

30 Note The effect does not imply any causality, just the influence of a variable or interaction between variables on the log of the expected number of observations in a cell

31 Two way tables The second log-linear model represents independence of the two variables (X and Y) and is a reduced model: Log f ij = constant + λ i x + λ k y The interpretation of this model is that the log of the expected frequency in any cell is a function of the mean of the log of all the expected frequencies plus the effect of variable x and the effect of variable y. This is an additive linear model with no interactions between the two variables

32 Interpretation The parameters of the log-linear models are the effects of a particular category of each variable on the expected frequencies: i.e. a larger λ means that the expected frequencies will be larger for that variable. These variables are also deviations from the mean of all expected frequencies

33 Null hypothesis of independence The H o is that the sampling or experimental units come from a population of units in which the two variables (rows and columns) are independent of each other in terms of the cell frequencies It is also a test that λ jk xy =0: There is NO interaction between two variables

34 Test We can test this H o by comparing the fit of the model without this term to the saturated model that includes this term We determine the fit of each model by calculating the expected frequencies under each model, comparing the observed and expected frequencies and calculating the log-likelihood of each model

35 Test We then compare the fit of the two models with the likelihood ratio test statistic ∆ However the sampling distribution of this ratio (∆ ) is not well known, so instead we calculate G 2 statistic G 2 =-2log∆ G 2 Follows a Χ 2 distribution for reasonable sample sizes and can be generalized to =- 2(log-likelihood reduced model -- log-likelihood full model)

36 Degrees of freedom The calculated G 2 is compared to a Χ 2 distribution with (I-1)(J-1) df. This df (I-1)(J-1) is the difference between the df for the full model (IJ-1) and the df for the reduced model [(I-1)+(j-1)]

37 Akaike information criteria Hirotugu Akaike

38 The full model

39 Complete table ModelG2G2 dfPAIC 1D+S+M42.7670.00128.76 2D*S42.6860.00130.68 3D*M13.2450.0213.24 4S*M37.9850.00127.98 5D*S+D*M13.1640.015.16 6D*S+S*M37.8940.00129.89 7D*M+S*M8.4630.0372.46 8D*S+D*M+S*M7.1920.0273.19 9Saturated full model00

40 Two way interactions (marginal independence) D+S+M42.76 reference d.fP D*S 1vs 2 42.6759 42.76-42.68=0.084 7-6 =1 0.769 D*M 1vs 3 13.24 42.76-13.24=29.520 7-5 =2 <0.001 S*M 1 vs 4 37.98 42.76-37.98=4.778 7-5 =2 0.092

41 Three way interaction Death*Sex*Marrow Models compared 8 vs 9 G 2 = 7.19 df 2 P=0.027

42 Conditional independence termModels comparedG2G2 dfP D*S7 vs 81.2810.259 D*M6 vs 830.7120.001 S*M5 vs 85.9720.051 Death and marrow have a partial association

43 FemalesMarrow DeathOGSWFTGTotal PRED3226866 NPRED2661648 Total583224114 MalesMarrow DeathOGSWFTGTotal PRED43141067 NPRED1272645 Total552136112 Conditional independence

44 Males95 % CIFemales OG vs TG0.1070.041-0.2830.4060.150-1.097 SWF vs TG0.1920.060-0.6160.1150.034-0.395 SWF vs OG0.5580.184-1.6933.5211.261-9.836

45 Complete independence Models compared 1 vs 8 G 2 =35.57 df= 5 P=<0.001

46 Warning Always fit a saturated model first, containing all the variables of interest and all the interactions involving the (potential) nuisance variables. Only delete from the model the interactions that involve the variables of interest.


Download ppt "The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The."

Similar presentations


Ads by Google