The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.

Slides:



Advertisements
Similar presentations
Contingency Tables Chapters Seven, Sixteen, and Eighteen Chapter Seven –Definition of Contingency Tables –Basic Statistics –SPSS program (Crosstabulation)
Advertisements

Chi-Squared Hypothesis Testing Using One-Way and Two-Way Frequency Tables of Categorical Variables.
Chapter 13: The Chi-Square Test
Chapter 11 Contingency Table Analysis. Nonparametric Systems Another method of examining the relationship between independent (X) and dependant (Y) variables.
Loglinear Models for Contingency Tables. Consider an IxJ contingency table that cross- classifies a multinomial sample of n subjects on two categorical.
12.The Chi-square Test and the Analysis of the Contingency Tables 12.1Contingency Table 12.2A Words of Caution about Chi-Square Test.
CJ 526 Statistical Analysis in Criminal Justice
Chi Square Test Dealing with categorical dependant variable.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
BCOR 1020 Business Statistics
Log-linear analysis Summary. Focus on data analysis Focus on underlying process Focus on model specification Focus on likelihood approach Focus on ‘complete-data.
Presentation 12 Chi-Square test.
Chapter 13 Chi-Square Tests. The chi-square test for Goodness of Fit allows us to determine whether a specified population distribution seems valid. The.
The Chi-square Statistic. Goodness of fit 0 This test is used to decide whether there is any difference between the observed (experimental) value and.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Chapter 10 Analyzing the Association Between Categorical Variables
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.3 Determining.
AS 737 Categorical Data Analysis For Multivariate
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.3 Determining.
Categorical Data Prof. Andy Field.
This Week: Testing relationships between two metric variables: Correlation Testing relationships between two nominal variables: Chi-Squared.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
CJ 526 Statistical Analysis in Criminal Justice
Chapter 11: Applications of Chi-Square. Count or Frequency Data Many problems for which the data is categorized and the results shown by way of counts.
Chi-Square as a Statistical Test Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables.
Chapter 9: Non-parametric Tests n Parametric vs Non-parametric n Chi-Square –1 way –2 way.
BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence.
Discrete Multivariate Analysis Analysis of Multivariate Categorical Data.
1 In this case, each element of a population is assigned to one and only one of several classes or categories. Chapter 11 – Test of Independence - Hypothesis.
Introduction Many experiments result in measurements that are qualitative or categorical rather than quantitative. Humans classified by ethnic origin Hair.
Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.
Chi- square test x 2. Chi Square test Symbolized by Greek x 2 pronounced “Ki square” A Test of STATISTICAL SIGNIFICANCE for TABLE data.
Nonparametric Tests: Chi Square   Lesson 16. Parametric vs. Nonparametric Tests n Parametric hypothesis test about population parameter (  or  2.
Data Analysis for Two-Way Tables. The Basics Two-way table of counts Organizes data about 2 categorical variables Row variables run across the table Column.
CHI SQUARE TESTS.
HYPOTHESIS TESTING BETWEEN TWO OR MORE CATEGORICAL VARIABLES The Chi-Square Distribution and Test for Independence.
Chi Square Classifying yourself as studious or not. YesNoTotal Are they significantly different? YesNoTotal Read ahead Yes.
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
1 Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association?
Chapter Outline Goodness of Fit test Test of Independence.
Chapter 11: Chi-Square  Chi-Square as a Statistical Test  Statistical Independence  Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
Log-linear Models HRP /03/04 Log-Linear Models for Multi-way Contingency Tables 1. GLM for Poisson-distributed data with log-link (see Agresti.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Leftover Slides from Week Five. Steps in Hypothesis Testing Specify the research hypothesis and corresponding null hypothesis Compute the value of a test.
Making Comparisons All hypothesis testing follows a common logic of comparison Null hypothesis and alternative hypothesis – mutually exclusive – exhaustive.
Chapter 13- Inference For Tables: Chi-square Procedures Section Test for goodness of fit Section Inference for Two-Way tables Presented By:
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
The Chi-Square Distribution  Chi-square tests for ….. goodness of fit, and independence 1.
Section 10.2 Objectives Use a contingency table to find expected frequencies Use a chi-square distribution to test whether two variables are independent.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
Log-linear Models Please read Chapter Two. We are interested in relationships between variables White VictimBlack Victim White Prisoner151 (151/160=0.94)
I. ANOVA revisited & reviewed
Categorical Data Aims Loglinear models Categorical data
Qualitative data – tests of association
Goodness of Fit Tests The goal of χ2 goodness of fit tests is to test is the data comes from a certain distribution. There are various situations to which.
The Chi-Square Distribution and Test for Independence
Methods in Experimental Ecology II
Testing for Independence
Chi Square Two-way Tables
Chapter 10 Analyzing the Association Between Categorical Variables
Quadrat sampling & the Chi-squared test
Quadrat sampling & the Chi-squared test
Presentation transcript:

The Analysis of Categorical Data

Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The data in such a study represents counts –or frequencies - of observations in each category

Analysis DataAnalysis A single categorical predictor variable Organized as two way contingency tables, and tested with chi-square or G-test Multiple predictor variables (or complex models) Organized as a multi- way contingency tables, and analyzed using either log-linear models or classification trees

Two way Contingency Tables Analysis of contingency tables is done correctly only on the raw counts, not on the percentages, proportions, or relative frequencies of the data

Wildebeest carcasses from the Serengeti (Sinclair and Arcese 1995)

Sex, cause of death, and bone marrow type Sex (males / females) Cause of death (predation / other) Bone marrow type: 1.Solid white fatty (healthy animal) 2.Opaque gelatinous 3.Translucent gelatinous

Data SexMarrowDeath by predation MaleSWFYes MaleOGYes MaleTGYes ………

Brief format SEXMARROWDEATHCOUNT FEMALESWFPRED26 MALESWFPRED14 FEMALEOGPRED32 MALEOGPRED43 FEMALETGPRED8 MALETGPRED10 FEMALESWFNPRED6 MALESWFNPRED7 FEMALEOGNPRED26 MALEOGNPRED12 FEMALETGNPRED16 MALETGNPRED26

Contingency table Sex * Death Crosstabulation Dead SexNPREDPREDTotal FEMALE MALE Total

Contingency table Sex * Marrow Crosstabulation Marrow SexOGSWFTGTotal FEMALE MALE Total

Contingency table Death * Marrow Crosstabulation Marrow DeathOGSWFTGTotal NPRED PRED Total

Are the variables independent? We want to know, for example, whether males are more likely to die by predation than females Specifying the null hypothesis: The predictor and response variable are not associated with each other. The two variables are independent of each other and the observed degree of association is not stronger than we would expect by chance or random sampling

Calculating the expected values The expected value is the total number of observations (N) times the probability of a population being both males and dead by predation

The probability of two independent events Because we have no other information than the data, we estimate the probabilities of each of the right hand terms from the equation from the marginal totals

Contingency table Sex * Death expected values Dead SexNPREDPREDP FEMALE MALE P N=226

Testing the hypothesis: Pearson’s Chi-square test = , P= = , P=0.8736

The degrees of freedom = 1

Calculating the P-value We find the probability of obtaining a value of Χ 2 as large or larger than relative to a Χ 2 distribution with 1 degree of freedom P = 0.769

An alternative The likelihood ratio test: It compares observed values with the distribution of expected values based on the multinomial probability distribution =

Two way contingency tables Sex * Death Crosstabulation: Sex * Marrow Crosstabulation: Marrow * Death Crosstabulation:

Which test to chose? ModelRows/ ColumnsSample size Test I II Not fixed Fixed/not fixed smallG-test, with corrections I II Not fixed Fixed/not fixed largeG-test, Chi square test IIIFixedFisher exact test

Log-linear models Multi-way Contingency Tables

Multiple two-way tables FemalesMarrow DeathOGSWFTGTotal PRED NPRED Total MalesMarrow DeathOGSWFTGTotal PRED NPRED Total

Log-linear models They treat the cell frequencies as counts distributed as a Poisson random variable The expected cell frequencies are modeled against the variables using the log-link and Poisson error term They are fit and parameters estimated using maximum likelihood techniques

Log-linear models Do not distinguish response and predictor variables: all the variables are considered equally as response variables

However A logit model with categorical variables can be analyzed as a log-linear model

Two way tables For a two way table (I by J) we can fit two log- linear models The first is a saturated (full) model Log f ij = constant + λ i x + λ k y + λ jk xy f ij = is the expected frequency in cell ij λ i x = is the effect of category i of variable X λ k y = is the effect of category k of variable Y λ jk xy = is the effect any interaction between X and Y This model fit the observed frequencies perfectly

Note The effect does not imply any causality, just the influence of a variable or interaction between variables on the log of the expected number of observations in a cell

Two way tables The second log-linear model represents independence of the two variables (X and Y) and is a reduced model: Log f ij = constant + λ i x + λ k y The interpretation of this model is that the log of the expected frequency in any cell is a function of the mean of the log of all the expected frequencies plus the effect of variable x and the effect of variable y. This is an additive linear model with no interactions between the two variables

Interpretation The parameters of the log-linear models are the effects of a particular category of each variable on the expected frequencies: i.e. a larger λ means that the expected frequencies will be larger for that variable. These variables are also deviations from the mean of all expected frequencies

Null hypothesis of independence The H o is that the sampling or experimental units come from a population of units in which the two variables (rows and columns) are independent of each other in terms of the cell frequencies It is also a test that λ jk xy =0: There is NO interaction between two variables

Test We can test this H o by comparing the fit of the model without this term to the saturated model that includes this term We determine the fit of each model by calculating the expected frequencies under each model, comparing the observed and expected frequencies and calculating the log-likelihood of each model

Test We then compare the fit of the two models with the likelihood ratio test statistic ∆ However the sampling distribution of this ratio (∆ ) is not well known, so instead we calculate G 2 statistic G 2 =-2log∆ G 2 Follows a Χ 2 distribution for reasonable sample sizes and can be generalized to =- 2(log-likelihood reduced model -- log-likelihood full model)

Degrees of freedom The calculated G 2 is compared to a Χ 2 distribution with (I-1)(J-1) df. This df (I-1)(J-1) is the difference between the df for the full model (IJ-1) and the df for the reduced model [(I-1)+(j-1)]

Akaike information criteria Hirotugu Akaike

The full model

Complete table ModelG2G2 dfPAIC 1D+S+M D*S D*M S*M D*S+D*M D*S+S*M D*M+S*M D*S+D*M+S*M Saturated full model00

Two way interactions (marginal independence) D+S+M42.76 reference d.fP D*S 1vs = = D*M 1vs = =2 <0.001 S*M 1 vs = =

Three way interaction Death*Sex*Marrow Models compared 8 vs 9 G 2 = 7.19 df 2 P=0.027

Conditional independence termModels comparedG2G2 dfP D*S7 vs D*M6 vs S*M5 vs Death and marrow have a partial association

FemalesMarrow DeathOGSWFTGTotal PRED NPRED Total MalesMarrow DeathOGSWFTGTotal PRED NPRED Total Conditional independence

Males95 % CIFemales OG vs TG SWF vs TG SWF vs OG

Complete independence Models compared 1 vs 8 G 2 =35.57 df= 5 P=<0.001

Warning Always fit a saturated model first, containing all the variables of interest and all the interactions involving the (potential) nuisance variables. Only delete from the model the interactions that involve the variables of interest.