New Ways of Looking at Binary Data Fitting in R Yoon G Kim, Colloquium Talk.

Slides:



Advertisements
Similar presentations
AP Statistics Course Review.
Advertisements

© Department of Statistics 2012 STATS 330 Lecture 32: Slide 1 Stats 330: Lecture 32.
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Logistic Regression Example: Horseshoe Crab Data
Part V The Generalized Linear Model Chapter 16 Introduction.
Logistic Regression.
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Generalised linear models
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
Generalised linear models Generalised linear model Exponential family Example: logistic model - Binomial distribution Deviances R commands for generalised.
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Chap 9-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 9 Estimation: Additional Topics Statistics for Business and Economics.
Generalised linear models Generalised linear model Exponential family Example: Log-linear model - Poisson distribution Example: logistic model- Binomial.
Gl
OMS 201 Review. Range The range of a data set is the difference between the largest and smallest data values. It is the simplest measure of dispersion.
Linear and generalised linear models
Linear and generalised linear models
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Transforming the data Modified from: Gotelli and Allison Chapter 8; Sokal and Rohlf 2000 Chapter 13.
Generalized Linear Models
Logistic Regression with “Grouped” Data Lobster Survival by Size in a Tethering Experiment Source: E.B. Wilkinson, J.H. Grabowski, G.D. Sherwood, P.O.
Logistic Regression and Generalized Linear Models:
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Lecture 6 Generalized Linear Models Olivier MISSA, Advanced Research Skills.
7.1 - Motivation Motivation Correlation / Simple Linear Regression Correlation / Simple Linear Regression Extensions of Simple.
© Department of Statistics 2012 STATS 330 Lecture 26: Slide 1 Stats 330: Lecture 26.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
The Triangle of Statistical Inference: Likelihoood
Random Sampling, Point Estimation and Maximum Likelihood.
Introduction to Generalized Linear Models Prepared by Louise Francis Francis Analytics and Actuarial Data Mining, Inc. October 3, 2004.
Lecture 5 Linear Mixed Effects Models
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
Logistic regression. Analysis of proportion data We know how many times an event occurred, and how many times did not occur. We want to know if these.
November 5, 2008 Logistic and Poisson Regression: Modeling Binary and Count Data LISA Short Course Series Mark Seiss, Dept. of Statistics.
When and why to use Logistic Regression?  The response variable has to be binary or ordinal.  Predictors can be continuous, discrete, or combinations.
April 4 Logistic Regression –Lee Chapter 9 –Cody and Smith 9:F.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
A preliminary exploration into the Binomial Logistic Regression Models in R and their potential application Andrew Trant PPS Arctic - Labrador Highlands.
Applied Statistics Week 4 Exercise 3 Tick bites and suspicion of Borrelia Mihaela Frincu
Count Data. HT Cleopatra VII & Marcus Antony C c Aa.
© Department of Statistics 2012 STATS 330 Lecture 22: Slide 1 Stats 330: Lecture 22.
Université d’Ottawa - Bio Biostatistiques appliquées © Antoine Morin et Scott Findlay :32 1 Logistic regression.
Logistic Regression. Example: Survival of Titanic passengers  We want to know if the probability of survival is higher among children  Outcome (y) =
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
© Department of Statistics 2012 STATS 330 Lecture 24: Slide 1 Stats 330: Lecture 24.
Dependent Variable Discrete  2 values – binomial  3 or more discrete values – multinomial  Skewed – e.g. Poisson Continuous  Non-normal.
04/19/2006Econ 6161 Econ 616 – Spring 2006 Qualitative Response Regression Models Presented by Yan Hu.
Logistic Regression and Odds Ratios Psych DeShon.
The Probit Model Alexander Spermann University of Freiburg SS 2008.
R Programming/ Binomial Models Shinichiro Suna. Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Unit 32: The Generalized Linear Model
The Probit Model Alexander Spermann University of Freiburg SoSe 2009
Transforming the data Modified from:
BINARY LOGISTIC REGRESSION
Logistic regression.
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
Logistic Regression APKC – STATS AFAC (2016).
CHAPTER 7 Linear Correlation & Regression Methods
Generalized Linear Models
Generalized Linear Models
SAME THING?.
Simple Linear Regression
Logistic Regression with “Grouped” Data
Introductory Statistics
Presentation transcript:

New Ways of Looking at Binary Data Fitting in R Yoon G Kim, Colloquium Talk

2 Can we “stabilize” this? Appetizer

3 After taking LOG … > y1 <- rep(c(100,200),times=10) > y2 <- rep(c(10,20),times=10) > x <- c(1:20) > data <- cbind(x,y1,y2) > data[1:3,] x y1 y2 [1,] [2,] [3,] > par(mfrow=c(1,2)) > plot(y1~x,type="l",ylim=c(0,250),col="blue",ylab="") > lines(y2~x,type="l",col="red") > plot(log(y1)~x,type="l",ylim=c(0,6),col="blue",ylab="") > lines(log(y2)~x,type="l",col="red") Log transformed

4

5 Outline Exploring options available when assumptions of classical linear models are untenable. In this talk: What can we do when observations are not continuous and the residuals are not normally distributed nor identically distributed ?

6 Defined by three assumptions: (1) the response variable is continuous. (2) the residuals ( ε ) are normally distributed and... (3)... independently (3a) and identically distributed (3b). Today, we will consider a range of options available when assumptions (1) (2) and/or (3b) are not verified. Classical Linear Models

7 Many situations exist: The response variable could be (1) a count (number of individuals in a population) (number of species in a community) (2) a proportion (proportion "cured" after treatment) (proportion of threatened species) (3) a categorical variable (breeding/non-breeding) (different phenotypes) (4) a strictly positive value (esp. time to success) (or time to failure) (... ) and so forth Non-continuous response variable

8 These types of non-continuous variables also tend to deviate from the assumptions of Normality (assumption #2) and Homoscedasticity (assumption #3b) (1) A count variable often follows a Poisson distribution (where the variance increases linearly with the mean) (2) A proportion often follows a Binomial distribution (where the variance reaches a maximum for intermediate values and a minimum at either end: 0% or 100%) Added difficulties

9 These types of non-continuous variables also tend to deviate from the assumptions of Normality (assumption #2) and Homoscedasticity (assumption #3b). (3) A categorical variable tends to follow a Binomial distribution (when the variable has only two levels) or a Multinomial distribution (when the variable has more than two levels) (4) Time to success/failure can follow an exponential distribution or an inverse Gaussian distribution (the latter having a variance increasing much more quickly than the mean). Added difficulties

10 Many of these situations can be unified under a central framework. Since all these distributions (and a few more) belong to the exponential family of distributions. Fortunately Probability density function (if y is continuous) Probability mass function (if y is discrete) Canonical (location) parameter Dispersion parameter Canonical form mean variance

11 The Normal distribution Probability density function Canonical form   Canonical (location) parameter Dispersion parameter

12 The Poisson distribution Probability mass function Canonical form   = 1 = 1 Canonical (location) parameter Dispersion parameter

13 The Binomial distribution Probability mass function Canonical form   = 1 = 1 Canonical (location) parameter Dispersion parameter

14 Why is that remotely useful? 1) A single algorithm (maximum likelihood) will cope with all these situations. 2) Different types of Variance can be accommodated When Var is constant -> Normal (Gaussian) When Var increases linearly with the mean -> Poisson When Var has a humped back shape -> Binomial When Var increases as the square of the mean -> Gamma (means the coefficient of variation remains constant) When Var increases as the cube of the mean -> inverse Gaussian 3) Most types of data are thus effectively covered

15

16 Two ways to cope with non-independent observations When design is balanced ( "equal sample size" ) We can use factors to partition our observations in different "groups" and analyze them as an ANOVA or ANCOVA. … when factors are "crossed" or when they are “nested" When design is unbalanced ( "uneven sample size" ) Mixed effect models are then called for. Non-independent Observations

17 How does it work? 1) You need to specify the family of distribution to use 2) You need to specify the link function linear predictor link function For each type of variable the "natural" link function to use is indicated by the canonical parameter Link NormalIdentity Poisson Log Binomial Logit GammaInverse Inv.Gaussian Inverse square

18 Binary variable The response variable contains only 0’s and 1’s. The probability that a place is “occupied” is p, and we write The objective is to determine how Y influences p. The family to use is Binomial and the canonical link is logit. Example: The response is occupation of territories and the explanatory variable is the resource availability in each territory > occupy <- read.table("D:\\STAT999\\RBook\\occupation.txt",header=T) > dim(occupy) [1] > occupy[1:3,] resources occupied > attach(occupy) Crawley, M.J. (2007) The R Book:

19 Binary variable > table(occupied) occupied > modell <- glm(occupied~resources, family=binomial) > > plot(resources, occupied, type="n") > rug(jitter(resources[occupied==0])) > rug(jitter(resources[occupied==1]),side=3) > xv <- 0:1000 > yv <- predict(modell, list(resources=xv),type="response") by default the link for a Binomial is logistic

20

21 cutr <- cut(resources,5) tapply(occupied,cutr,sum) (13.2,209] (209,405] (405,600] (600,796] (796,992] table(cutr) cutr (13.2,209] (209,405] (405,600] (600,796] (796,992] probs <- tapply(occupied,cutr,sum)/table(cutr) probs (13.2,209] (209,405] (405,600] (600,796] (796,992] attr(,"class") [1] "table" probs <- as.vector(probs) resmeans <- tapply(resources,cutr,mean) resmeans <- as.vector(resmeans) points(resmeans,probs,pch=16,cex=2) se <- sqrt(probs*(1-probs)/table(cutr)) up <- probs + as.vector(se) down <- probs - as.vector(se) for(i in 1:5) { lines(c(resmeans[i],resmeans[i]),c(up[i],down[i]))}

22

23 > grid_x <- seq(10,990,by=0.5) > modell_p <- predict(modell,new=data.frame(resources=grid_x),type="response") > modelp <- glm(occupied~resources, family=binomial(link=probit)) > modelp_p <- predict(modelp,new=data.frame(resources=grid_x),type="response") > modelcl <- glm(occupied~resources, family=binomial(link=cloglog)) > modelcl_p <- predict(modelcl,new=data.frame(resources=grid_x),type="response") > modelca <- glm(occupied~resources, family=binomial(link=cauchit)) > modelca_p <- predict(modelca,new=data.frame(resources=grid_x),type="response") Various Link Functions

24 To draw … > newdata <- data.frame(grid_x,modell_p,modelp_p,modelcl_p,modelca_p) > library(lattice) > print(xyplot(modell_p+modelp_p+modelcl_p+ modelca_p ~ grid_x, + data=newdata, type ="l", xlab="resources", + ylab="p",lwd=1.5, lty=c(1,2,3,4), col=c(1:4), + panel = function(x, y,...) { + panel.xyplot(x, y,...) + panel.text(occupy$resources,occupy$probs,"x", cex=1.5, type="p",...) + })) > legend("topleft", legend=c("logit","probit","cloglog","cauchit"),lty=c(1:4), col=c(1:4), lwd=1.5) > > par(new=F) > points(resmeans,probs,pch=16,cex=2) > for (i in 1:5){ + lines(c(resmeans[i],resmeans[i]),c(up[i],down[i]))}

25

26 Binary variable > summary(modell) Call: glm(formula = occupied ~ resources, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-08 *** resources e-10 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 149 degrees of freedom Residual deviance: on 148 degrees of freedom AIC: Number of Fisher Scoring iterations: 6 Only valid if the Response variable is indeed a binomial also called G-statistic

27 Binary variable > (dp <- sum(residuals(modell, type="pearson")^2)/modell$df.res) [1] ) Pearson's residuals This dispersion parameter (  ) must be calculated. Residual degrees of freedom Suggests that the Variance is 0.85 times the Mean. In statistical terms there is no overdispersion. In biological terms, it suggests that the counts are independent from each other and are not Aggregated (i.e. Clumped). Typically Overdispersed count data follow a Negative Binomial distribution, which is not part of the Exponential families of distribution. It won't be covered here, but it can be approximated as a quasi-binomial (family="quasibinomial"). If you need it in your future work, you can also try glm.nb (in MASS package)

28 Binary variable > summary(modell, dispersion=dp) Call: glm(formula = occupied ~ resources, family = binomial) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-09 *** resources e-11 *** --- (Dispersion parameter for binomial family taken to be ) Null deviance: on 149 degrees of freedom Residual deviance: on 148 degrees of freedom AIC: Number of Fisher Scoring iterations: 6 The summary table can be adjusted with the dispersion parameter These Values can now be taken at face value How good is the model? 1 – (Res. Dev. / Null Dev.) = %

29 > summary(modell) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-08 *** resources e-10 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: on 149 degrees of freedom Residual deviance: on 148 degrees of freedom AIC: > summary(modelp) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-10 *** resources e-12 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: on 149 degrees of freedom Residual deviance: on 148 degrees of freedom AIC:

30 > summary(modelcl) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) e-09 *** resources e-10 *** (Dispersion parameter for binomial family taken to be 1) Null deviance: on 149 degrees of freedom Residual deviance: on 148 degrees of freedom AIC: > summary(modelca) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) *** resources *** (Dispersion parameter for binomial family taken to be 1) Null deviance: on 149 degrees of freedom Residual deviance: on 148 degrees of freedom AIC:

31 Bootstrapping > modell <- glm(occupied~resources,family=binomial) > bcoef <- matrix(0,1000,2) > for (i in 1:1000){ + indices <-sample(1:150,replace=T) + x <- resources[indices] + y <- occupied[indices] + modell <- glm(y~x, family=binomial) + bcoef[i,] <- modell$coef } > par(mfrow=c(1,2)) > plot(density(bcoef[,2]),xlab="Coefficient of x",main="") > abline(v=quantile(bcoef[,2],c(0.025,0.975)),lty=2, col=4) > plot(density(bcoef[,1]),xlab="Intercept",main="") > abline(v=quantile(bcoef[,1],c(0.025,0.975)),lty=2, col=4)

32

33 Jackknifing > jcoef <- matrix(0,150,2) > for (i in 1:150) { + modelj<-glm(occupied[-i]~resources[-i], family=binomial) + jcoef[i,] <- modelj$coef + } > par(mfrow=c(1,2)) > plot(density(jcoef[,2]),xlab="Coefficient of x",main="") > abline(v=quantile(jcoef[,2],c(0.025,0.975)),lty=2, col=4) > plot(density(jcoef[,1]),xlab="Intercept",main="") > abline(v=quantile(jcoef[,1],c(0.025,0.975)),lty=2, col=4)

34

35 C.I.’s > library(boot) > reg.boot<-function(regdat, index){ + x <- resources[index] + y <- occupied[index] + modell <- glm(y~x, family=binomial) + coef(modell) } > reg.model<-boot(occupy,reg.boot,R=10000) > boot.ci(reg.model,index=2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on bootstrap replicates Intervals : Level Normal Basic 95% ( , ) ( , ) Level Percentile BCa 95% ( , ) ( , ) Calculations and Intervals on Original Scale

36 > jack.after.boot(reg.model,index=2)

th observation? > occupy[105:110,] resources occupied > plot(resources, occupied) > text(resources[108],occupied[108],"Here",cex = 1.5,col="blue",pos=3) OR > fat.arrow <- function(size.x=0.5,size.y=0.5,ar.col="red"){ + size.x <- size.x*(par("usr")[2]-par("usr")[1])*0.1 + size.y <- size.y*(par("usr")[4]-par("usr")[3])*0.1 + pos <- locator(1) + xc <- c(0,1,0.5,0.5,-0.5,-0.5,-1,0) + yc <- c(0,1,1,6,6,1,1,0) + polygon(pos$x+size.x*xc,pos$y+size.y*yc,col=ar.col) } > fat.arrow()

38

Yoon G Kim, Thank You!