R Programming/ Binomial Models Shinichiro Suna
Binomial Models In binomial model, we have one outcome which is binary and a set of explanatory variables.
Contents 1 Logit model 1.1 Fake data simulations 1.2 Maximum likelihood estimation 1.3 Bayesian estimation 2 Probit model 2.1 Fake data simulations 2.2 Maximum likelihood estimation 2.3 Bayesian estimation
1. Logit model (Logistic Regession Analysis) Logit model (Logistic regression analysis) uses the logistic function. When there are several explanatory variables, F(x) = 1 / 1+ exp( - (B0 + B1*X1 + B2*X2 + ….) )
1. Fake data simulations x <- 1 + rnorm(1000,1) xbeta < (x* 1) proba <- exp(xbeta)/(1 + exp(xbeta)) y <- ifelse(runif(1000,0,1) < proba,1,0) table(y) df <- data.frame(y, x)
1.2. Maximum likelihood estimation The standard way to estimate a logit model is glm() function with family binomial and link logit. (Fitting Generalized Linear Models)
1.2. Maximum likelihood estimation # Fitting Generalized Linear Models res <- glm(y ~ x, family = binomial(link=logit)) names(res) summary(res) # results confint(res) # confindence intervals exp(res$coefficients) # odds ratio exp(confint(res)) # Confidence intervals for odds ratio (delta method)
1.2. Maximum likelihood estimation > summary(res) # results Call: glm(formula = y ~ x, family = binomial(link = logit)) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) * x *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 99 degrees of freedom Residual deviance: on 98 degrees of freedom AIC: Number of Fisher Scoring iterations: 4
1. 3. Bayesian estimation # Data generating process x <- 1 + rnorm(1000,1) xbeta < (x* 1) proba <- exp(xbeta)/(1 + exp(xbeta)) y <- ifelse(runif(1000,0,1) < proba,1,0) table(y) # Markov Chain Monte Carlo for Logistic Regression library(MCMCpack) res <- MCMClogit(y ~ x) summary(res) plot(res)
1. 3. Bayesian estimation > summary(res) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) x Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) x
2. Probit model The probit model is a type of regression where the dependent variable can only take two values. The name is from probability + unit.
2. Probit model Probit model uses the cumulative density function of a normal distribution.
2.1 Probit model - fake data simulation - # Generating Fake Data x1 <- 1 + rnorm(1000) x2 < x1 + rnorm(1000) xbeta < x1 + x2 proba <- pnorm(xbeta) y <- ifelse(runif(1000,0,1) < proba,1,0) mydat <- data.frame(y,x1,x2) table(y)
2. 2. Maximum likelihood # Fitting Generalized Linear Models res <- glm(y ~ x1 + x2, family = binomial(link=probit), data = mydat) names(res) summary(res) exp(res$coefficients) # odds ratio exp(confint(res)) # Confidence intervals for odds ratio (delta method)
2. 2. Maximum likelihood > summary(res) Call: glm(formula = y ~ x1 + x2, family = binomial(link = probit), data = mydat) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) ** x ** x e-05 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 99 degrees of freedom Residual deviance: on 97 degrees of freedom AIC: Number of Fisher Scoring iterations: 9
2. 2. Maximum likelihood library("sampleSelection") Res <- probit(y ~ x1 + x2, data = mydat) summary(res)
2. 2. Maximum likelihood > summary(res) Probit binary choice model/Maximum Likelihood estimation Newton-Raphson maximisation, 7 iterations Return code 1: gradient close to zero Log-Likelihood: Model: Y == '1' in contrary to '0' 100 observations (59 'negative' and 41 'positive') and 3 free parameters (df = 97) Estimates: Estimate Std. error t value Pr(> t) (Intercept) ** x ** x e-05 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Significance test: chi2(2) = (p= e-21)
2. 3. Bayesian estimation # Markov Chain Monte Carlo for Probit Regression library("MCMCpack") post <- MCMCprobit(y ~ x1 + x2, data = mydat) summary(post) plot(post)
2. 3. Bayesian estimation > summary(post) Iterations = 1001:11000 Thinning interval = 1 Number of chains = 1 Sample size per chain = Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) x x Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) x x