BIOL 582 Lecture Set 18 Analysis of frequency and categorical data Part III: Tests of Independence (cont.) Odds Ratios Loglinear Models Logistic Models.

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.
Logistic Regression Example: Horseshoe Crab Data
Log-linear Analysis - Analysing Categorical Data
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Statistical Methods in Computer Science Hypothesis Testing III: Categorical dependence and Ido Dagan.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Generalised linear models
Maximum likelihood (ML) and likelihood ratio (LR) test
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Linear and generalised linear models
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
Linear and generalised linear models
Maximum likelihood (ML)
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Sampling Theory Determining the distribution of Sample statistics.
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.5 Small Sample.
Chapter 10 Analyzing the Association Between Categorical Variables
How Can We Test whether Categorical Variables are Independent?
AS 737 Categorical Data Analysis For Multivariate
Categorical Data Prof. Andy Field.
Overview of Meta-Analytic Data Analysis
Lecture 15: Logistic Regression: Inference and link functions BMTRY 701 Biostatistical Methods II.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
Chapter 11: Applications of Chi-Square. Count or Frequency Data Many problems for which the data is categorized and the results shown by way of counts.
BIOL 582 Lecture Set 17 Analysis of frequency and categorical data Part II: Goodness of Fit Tests for Continuous Frequency Distributions; Tests of Independence.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
MBP1010 – Lecture 8: March 1, Odds Ratio/Relative Risk Logistic Regression Survival Analysis Reading: papers on OR and survival analysis (Resources)
Nonparametric Tests IPS Chapter 15 © 2009 W.H. Freeman and Company.
+ Chi Square Test Homogeneity or Independence( Association)
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
1 Chapter 11: Analyzing the Association Between Categorical Variables Section 11.1: What is Independence and What is Association?
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Logistic Regression Analysis Gerrit Rooks
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
Logistic Regression and Odds Ratios Psych DeShon.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Nonparametric Statistics
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 10: Comparing Models.
Transforming the data Modified from:
BINARY LOGISTIC REGRESSION
Logistic regression.
CHAPTER 7 Linear Correlation & Regression Methods
Notes on Logistic Regression
Categorical Data Aims Loglinear models Categorical data
Nonparametric Statistics
Review for Exam 2 Some important themes from Chapters 6-9
Logistic Regression with “Grouped” Data
Presentation transcript:

BIOL 582 Lecture Set 18 Analysis of frequency and categorical data Part III: Tests of Independence (cont.) Odds Ratios Loglinear Models Logistic Models

BIOL 582Tests of independence Sokal and Rohlf (2011) describe and recommend the following We have considered examples for Models I and II. Model III lends itself well to Fisher’s Exact Test, but this test or any test of independence can be done on any Model type. The important thing to remember is that all tests tend to have inflated type I error rates and will be less robust with small samples. Correction factors are often used as a result Fisher’s Exact Test is often used with smaller sample sizes But it should really only be used if both criteria of the table (row and column totals) are fixed ModelFrequency TotalsRecommended Test INot fixedG-test for independence* II Fixed for one criterion G-test for independence* III Fixed for both criteria Fisher’s Exact Test

BIOL 582Tests of independence Sokal and Rohlf (2011) also describe and recommend the following A good example for illustrating the utility of this test is Dr. Bristol’s clairvoyance. This is stolen right from Wikipedia: Fisher is said to have devised the test following a comment from Dr. Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup…. So in Fisher's original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Dr Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated – that is, whether Dr Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table. The p-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Dr Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table. ModelFrequency TotalsRecommended Test INot fixedG-test for independence* II Fixed for one criterion G-test for independence* III Fixed for both criteria Fisher’s Exact Test

BIOL 582Tests of independence Example Model III (Box 17.7 Sokal and Rohlf 2011) Acacia trees are cleared in an area of Central America, except for 28 lucky bushes, which are fumigated. 15 are species A; 13 are species B 16 ant colonies are released into the experimental zone (each colony can infect exactly one tree) Thus, 12 trees will be uninfected Since the number of species are fixed and the number of infected and uninfected trees are fixed, this is Model III H 0 : Tree infection is independent of species Here is the result What is the probability of this result if the null hypothesis is true? (I.e., what is the probability it could happen by chance?) Acacia speciesinvadedfreeTotal Species A21315 Species B10313 total121628

BIOL 582Tests of independence Example Model III (Box 17.7 Sokal and Rohlf 2011) First realize that the one must find all probabilities that are as rare as the observed case, which is this And this Acacia speciesinvadedfreeTotal Species A11415 Species B11213 total Acacia speciesinvadedfreeTotal Species A015 Species B12113 total Acacia speciesinvadedfreeTotal Species A21315 Species B10313 total121628

BIOL 582Tests of independence Example Model III (Box 17.7 Sokal and Rohlf 2011) The probability of each event using the hypergeometric distribution is Acacia species invadedfreeTotal Species Aaba + b Species Bcdc + d totala + cb + dn

BIOL 582Tests of independence Example Model III (Box 17.7 Sokal and Rohlf 2011) Which for each event is Acacia speciesinvadedfreeTotal Species A11415 Species B11213 total Acacia speciesinvadedfreeTotal Species A015 Species B12113 total Acacia speciesinvadedfreeTotal Species A21315 Species B10313 total121628

BIOL 582R x C Tests of independence An R x C Test of Independence means that one factor is described by rows and one factor is described by columns And R x C test is a test of a two-way table, even if there are more than two rows and more than two columns. The format does not change. One has these three options: 1.Calculate expected cell values (frequencies) as r*c/n, where r and c are the marginal totals of the row and column where a cell exists, and calculate a Chi-square statistic 2.Calculate all f log f values for every cell and margin total, plus n log n. Then calculate 3.Calculate the exact probability of the table outcome given a known distribution of outcomes (when factor sums [margins] are fixed). This is the Fisher’s exact test we just saw These are all more or less slightly different twists of the same theme We might try a more complicated two-way table in R, but the concept is the same

BIOL 582Three-way and multi-way Table Tests of independence A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand all the details Here is an example three-way table PopulationSexBlindNot BlindTotals 1F M total F M total F M total Totals

BIOL 582Three-way and multi-way Table Tests of independence A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand all the details It might be obvious that testing such a table would be an iterative process PopulationSexBlindNot BlindTotals 1F M total F M total F M total Totals

BIOL 582Three-way and multi-way Table Tests of independence A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand al the details It might be obvious that testing such a table would be an iterative process Multi-way tables are made easier to analyze with loglinear models For example, a two-way table can be expressed by the following model Unlike factorial ANOVA, only the interaction is important: it expresses the dependence of factors on one another, so the null hypothesis of independence assumes it is 0. A G test is (thus) a likelihood ratio test between this “full” model and a “reduced” model which lacks the interaction. Mean of logarithms of expected frequencies Effect of category i of factor A Effect of category j of factor B The dependence of category i of factor A on category j of factor B

BIOL 582Three-way and multi-way Table Tests of independence A three-way table is much more complicated. We will not go into its complicatedness. Consult other sources if you must understand al the details It might be obvious that testing such a table would be an iterative process Multi-way tables are made easier to analyze with loglinear models A three-way table can be expressed by the following model We will leave it as sufficient that model LRTs can be performed with different interactions removed to test specific dependencies, given other dependencies. Sometimes the three-way interaction is removed as it is deemed cumbersome. Check out how complicated these analyses can become at Quick RQuick R

BIOL 582Binary Data and Proportions Now we turn our attention to response data that can have one of two outcomes Expressed/unexpressed Dead/alive Female/Male Success/failure Present/absent Often these types of responses are expressed as proportions, for the obvious reason that results can be generalized to smaller or larger groups. For example, the inoculated mouse data can be summarized as TreatmentDeadAliveTotal Bacteria + Antiserum Bacteria only total TreatmentDeadAliveΣ Bacteria + Antiserum p = q = Bacteria onlyp = q =

BIOL 582Binary Data and Proportions Now we turn our attention to response data that can have one of two outcomes For example, the inoculated mouse data can be summarized as Proportions can be expressed as odds-ratios For this example,, which means that the odds of mice surviving with antiserum are approximately three times higher than without. TreatmentDeadAliveTotal Bacteria + Antiserum Bacteria only total TreatmentDeadAliveΣ Bacteria + Antiserum p = q = Bacteria onlyp = q =

BIOL 582Using logits for tests of independence Odds ratios can be conveniently decomposed when expressed as a logarithm. The utility of doing this is not readily apparent. An odds ratio is convenient; a difference in logits is not as much. So why do this? Logits are quantities that are approximately normally distributed and can be used with linear models When proportions are equal, logits are 0; when q (success or preferred) is larger than p, logits are positive; when q is smaller, logits are negative. Expected values of logits from linear models can be back-transformed to get expected proportions (probabilities) of two different levels of response.

BIOL 582Binary Data and Proportions Let’s re-examine the mouse data, using R The generalized linear model means apply a response that can be transformed to a linear model. The response, “life” cannot be used with a linear model, but the logit for this response can. > treat<-factor(c(rep("BA",57),rep("B",54))) > life<-factor(c(rep("Dead",13),rep("Alive",44),rep("Dead",25),rep("Alive",29))) > > # Use Generalized linear model > > glm.mouse<-glm(life~treat,family=binomial(link='logit')) > treat [1] BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA [23] BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA [45] BA BA BA BA BA BA BA BA BA BA BA BA BA B B B B B B B B B [67] B B B B B B B B B B B B B B B B B B B B B B [89] B B B B B B B B B B B B B B B B B B B B B B [111] B Levels: B BA > life [1] Dead Dead Dead Dead Dead Dead Dead Dead Dead Dead Dead [12] Dead Dead Alive Alive Alive Alive Alive Alive Alive Alive Alive [23] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [34] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [45] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [56] Alive Alive Dead Dead Dead Dead Dead Dead Dead Dead Dead [67] Dead Dead Dead Dead Dead Dead Dead Dead Dead Dead Dead [78] Dead Dead Dead Dead Dead Alive Alive Alive Alive Alive Alive [89] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [100] Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive Alive [111] Alive Levels: Alive Dead

BIOL 582Binary Data and Proportions Let’s re-examine the mouse data, using R > summary(glm.mouse) Call: glm(formula = life ~ treat, family = binomial(link = "logit")) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) treatBA * --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 110 degrees of freedom Residual deviance: on 109 degrees of freedom AIC: Number of Fisher Scoring iterations: 4

BIOL 582Binary Data and Proportions Let’s re-examine the mouse data, using R > predict(glm.mouse) > logit1<-predict(glm.mouse)[1];logit2<-predict(glm.mouse)[111] > > log.odds.ratio<-logit2-logit1 > odds.ratio<-exp(log.odds.ratio) > odds.ratio

BIOL 582Binary Data and Proportions Let’s re-examine the mouse data, using R Recall from the Goodness of Fit, G test, the unadjusted G = > glm.null<-glm(life~1,family=binomial(link='logit')) > > AIC(glm.null,glm.mouse) df AIC glm.null glm.mouse > > logLik(glm.mouse) 'log Lik.' (df=2) > logLik(glm.null) 'log Lik.' (df=1) > LRT<-2*(logLik(glm.mouse)-logLik(glm.null)); p.value<-1-pchisq(LRT[1],1) > LRT [1] > p.value [1]

BIOL 582Logistic Regression Modeling logits with linear models is typically called logistic regression More specifically, it is a generalized linear model with logit link function Logits can also be modeled with continuous variables Logistic models can be compared with AIC Logistic models can be subjected to stepwise procedures There are various methods for parameter estimation, but least-squares estimation is not one of them. Usually parameters are estimated with maximum likelihood or restricted maximum likelihood. One method that is popular with ecological niche modeling (species presence or absence based on environmental variables) is maximum entropy, which allows for over-fitted models.

BIOL 582Generalized Linear Model We might come back to the generalized linear model later For now, this is all you need to know If a transformation of a variable allows it to be modeled with a linear model Then the generalized linear model is one that is linked to the original variable by the inverse of the function This seems subtle, but it is more complex than it seems. The reason is that instead of ascertaining the likelihood of the model by the error it produces, the analysis has to ask how should the parameters of the model be adjusted to produce the expected error. Thus the name “maximum likelihood”: it is the optimization of a function to estimate parameters that maximizes the likelihood to produce the error. Simply put, maximum likelihood procedures iteratively reweight parameter estimates until no better solution produces more ideal error.