Elements of Statistical Inference Theme of the workshop (and book): Analyzing HMs using both classical and Bayesian methods. “Dual inference paradigm”

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

MCMC estimation in MlwiN
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
Bayesian Estimation in MARK
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Sampling Distributions (§ )
Model Assessment, Selection and Averaging
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Maximum likelihood estimates What are they and why do we care? Relationship to AIC and other model selection criteria.
Machine Learning CMPT 726 Simon Fraser University
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Maximum likelihood (ML)
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
Modeling Menstrual Cycle Length in Pre- and Peri-Menopausal Women Michael Elliott Xiaobi Huang Sioban Harlow University of Michigan School of Public Health.
The Triangle of Statistical Inference: Likelihoood
Model Inference and Averaging
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
1 Physical Fluctuomatics 5th and 6th Probabilistic information processing by Gaussian graphical model Kazuyuki Tanaka Graduate School of Information Sciences,
Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
01/20151 EPI 5344: Survival Analysis in Epidemiology Maximum Likelihood Estimation: An Introduction March 10, 2015 Dr. N. Birkett, School of Epidemiology,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Chapter 8: Confidence Intervals based on a Single Sample
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Canadian Bioinformatics Workshops
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Prediction and Missing Data. Summarising Distributions ● Models are often large and complex ● Often only interested in some parameters – e.g. not so interested.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Estimating standard error using bootstrap
Bayesian Estimation and Confidence Intervals
MCMC Output & Metropolis-Hastings Algorithm Part I
Probability Theory and Parameter Estimation I
Model Inference and Averaging
Ch3: Model Building through Regression
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Ch13 Empirical Methods.
Parametric Methods Berlin Chen, 2005 References:
Sampling Distributions (§ )
CS639: Data Management for Data Science
Applied Statistics and Probability for Engineers
Classical regression review
Presentation transcript:

Elements of Statistical Inference Theme of the workshop (and book): Analyzing HMs using both classical and Bayesian methods. “Dual inference paradigm” Topics covered here: Classical inference (likelihood, frequentist) Bayesian inference (posterior distribution) Implementation in R (both MLE and MCMC) Case Study: logistic regression (not a HM) Case Study: Occupancy model (a HM)

Inference for statistical models Parametric inference: explicit probability assumptions about data. Inference proceeds assuming the model is truth. (not an approximation to truth, but actual truth) Two flavors – Classical inference – Bayesian inference

Bayesian vs. Classical/Frequentist Classical inference – Likelihood estimation (‘method of maximum likelihood’) – Frequentists use a relative frequency interpretation in which procedures are evaluated w.r.t. repeated realizations of the data. Probability is used to characterize how well procedures do, but not uncertainty about model parameters. Bayesian inference – Posterior inference: requires specification of a prior distribution – Bayesians make probability statements directly about model parameters, conditional on the single data set that you have

Notation

Classical inference You probably know this, but we review the basic ideas. And we show some technical elements in R to demystify what is being done in unmarked

What is the likelihood?

Example 1: Two independent binomial counts # 2 binomial observations y<- rbinom(2,size=10, p =.5) # The joint distribution function. As a function of y it gives # the probability of any two values of y1 and y2 jointdis<- function(data,K,p){ prod(dbinom(data, size=K, p=p)) } (jointdis(y, K=10, p =.5)) # also is the likelihood of p =.5 for the # given data, but it is NOT a probability for p. # Evaluate the likelihood for a grid of values of "p" p.grid<- seq(.01,.99,,200) likelihood<- rep(NA,200) for(i in 1:200){ likelihood[i]<- jointdis(y, K = 10, p=p.grid[i]) } # Plot the likelihood plot(p.grid,likelihood,xlab="p", ylab="likelihood")

Numerical maximization of the likelihood Numerical maximization of the likelihood was a HUGE change in applied statistics. Importance cannot be over-stated. Don’t need formulas (explicit estimators) Variances == numerically evaluated Can do “marginal likelihood” by integrating random effects out Don’t need a statistician to do things for you

Properties of MLEs

Other elements of classical inference

Parametric bootstrapping Obtain MLEs for a model Simulate data using those MLEs Obtain MLEs for simulated data Repeat many times Use the distribution of the MLEs of simulated data as an empirical estimate of the sampling distribution

Example 2: logistic regression

Ordinary logistic regression in R # Simulate data # Create a covariate called vegHt nSites <- 100 set.seed(2014) # so that we all get the same values of vegHt vegHt <- runif(nSites, 1, 3) # uniform from 1 to 3 # Suppose that occupancy probability increases with vegHt # The relationship is described by an intercept of -3 and # a slope parameter of 2 on the logit scale # plogis is the inverse-logit (constrains us back to the [0-1] scale) psi <- plogis(-3 + 2*vegHt) # Now we go to 100 sites and observe presence or absence # Actually, let's just simulate the data z <- rbinom(nSites, 1, psi)

General strategy for likelihood estimation: Express the negative log-likelihood as an R function and then use the standard function optim (or nlm ) to minimize it: # Definition of negative log-likelihood. negLogLike <- function(beta, y, x) { beta0 <- beta[1] beta1 <- beta[2] psi <- plogis(beta0 + beta1*x) # inverse-logit likelihood <- psi^y * (1-psi)^(1-y) # same as: # likelihood <- dbinom(y, 1, psi) return(-sum(log(likelihood))) } # Look at (negative) log-likelihood for 2 parameter sets negLogLike(c(0,0), y=z, x=vegHt) negLogLike(c(-3,2), y=z, x=vegHt) # Lower is better!

# Let's minimize it formally by function minimisation starting.values <- c(beta0=0, beta1=0) opt.out <- optim(starting.values, negLogLike, y=z, x=vegHt, hessian=TRUE) (mles <- opt.out$par) # MLEs are pretty close to truth beta0 beta # Alternative 1: Brute-force grid search for MLEs mat <- as.matrix(expand.grid(seq(-10,10,0.1), seq(-10,10,0.1))) # above: Can vary resolution nll <- array(NA, dim = nrow(mat)) for (i in 1:nrow(mat)){ nll[i] <- negLogLike(mat[i,], y = z, x = vegHt) } which(nll == min(nll)) mat[which(nll == min(nll)),] # Produce a likelihood surface, shown in Fig library(raster) r <- rasterFromXYZ(data.frame(x = mat[,1], y = mat[,2], z = nll)) mapPalette <- colorRampPalette(rev(c("grey", "yellow", "red"))) plot(r, col = mapPalette(100), main = "Negative log-likelihood", xlab = "Intercept (beta0)", ylab = "Slope (beta1)") contour(r, add = TRUE, levels = seq(50, 2000, 100))

# Alternative 2: Use canned R function glm as a shortcut (fm <- glm(z ~ vegHt, family = binomial)$coef) # Add 3 sets of MLEs into plot # 1. Add MLE from function minimisation points(mles[1], mles[2], pch = 1, lwd = 2) abline(mles[2],0) # Put a line through the Slope value lines(c(mles[1],mles[1]),c(-10,10)) # 2. Add MLE from grid search points(mat[which(nll == min(nll)),1], mat[which(nll == min(nll)),2], pch = 1, lwd = 2) # 3. Add MLE from glm function points(fm[1], fm[2], pch = 1, lwd = 2) # Note they are essentially all the same

Asymptotic variance/SE The hessian=TRUE option in the call to optim produces the Hessian matrix in the returned list opt.out, and so we can obtain the asymptotic standard errors (ASE) for the two parameters by doing this: Vc <- solve(opt.out$hessian) # Get variance-cov matrix ASE <- sqrt(diag(Vc)) # Extract asymptotic SEs print(ASE) beta0 beta

Summary # Make a table with estimates, SEs, and 95% CI mle.table <- data.frame(Est=mles, ASE = sqrt(diag(solve(opt.out$hessian)))) mle.table$lower <- mle.table$Est *mle.table$ASE mle.table$upper <- mle.table$Est *mle.table$ASE mle.table Est ASE lower upper beta beta # Plot the actual and estimated response curves plot(vegHt, z, xlab="Vegetation height", ylab="Occurrence probability") plot(function(x) plogis(beta0 + beta1 * x), 1.1, 3, add=TRUE, lwd=2) plot(function(x) plogis(mles[1] + mles[2] * x), 1.1, 3, add=TRUE, lwd=2, col="blue") legend(1.1, 0.9, c("Actual", "Estimate"), col=c("black", "blue"), lty=1, lwd=2)

Work session Different ways of obtaining MLEs: grid search, optim(), glm() Get the asymptotic SE (ASE) Plot a fitted response curve Bootstrap

Bootstrapping nboot < # Obtain 1000 bootstrap samples boot.out <- matrix(NA, nrow=nboot, ncol=3) dimnames(boot.out) <- list(NULL, c("beta0", "beta1", "psi.bar")) for(i in 1:1000){ # Simulate data psi <- plogis(mles[1] + mles[2] * vegHt) z <- rbinom(M, 1, psi) # Fit model tmp <- optim(mles, negLogLike, y=z, x=vegHt, hessian=TRUE)$par psi.mean <- plogis(tmp[1] + tmp[2] * mean(vegHt)) boot.out[i,] <- c(tmp, psi.mean) }

Bootstrapping SE.boot <- sqrt(apply(boot.out, 2, var)) # Get bootstrap SE names(SE.boot) <- c("beta0", "beta1", "psi.bar") # 95% bootstrapped confidence intervals apply(boot.out,2,quantile,c(0.025,0.975)) beta0 beta1 psi.bar 2.5% % # Boostrap SEs SE.boot beta0 beta1 psi.bar # Compare these with the ASEs for regression parameters mle.table Est ASE lower upper beta beta

Part II: Hierarchical Models

Modeling species occurrence: Occupancy models

Modeling species abundance from counts: The N-mixture model

Likelihood inference for hierarchical models

Example: Occupancy model $\bullet$ {\bf Observation model:} \begin{equation} y_{i} \sim \mbox{Binomial}(J, p*z_{i}) \end{equation} $\bullet$ {\bf State model:} \[ z_{i} \sim \mbox{Bernoulli}(\psi_{i}) \] \[ \mbox{logit}(\psi_{i}) = \beta_0 + \beta_{1} x_{i} \] $\bullet$ What is the marginal likelihood for $y$?

Example: Occupancy model

Doing it in R nSites <- 100 vegHt <- runif(nSites, 1, 3) # uniform from 1 to 3 psi <- plogis(-3 + 2*vegHt) # Now we simulate true presence/absence for 100 sites z <- rbinom(nSites, 1, psi) ## Now generate observations p<- 0.6 J<- 3 # sample each site 3 times y<-rbinom(nSites,J,p*z) # This is the negative log-likelihood. negLogLikeocc <- function(beta, y, x,J) { beta0 <- beta[1] beta1 <- beta[2] p<- plogis(beta[3]) psi <- plogis(beta0 + beta1*x) marg.likelihood <- dbinom(y, J,p)*psi + ifelse(y==0,1,0)*(1-psi) return(-sum(log(marg.likelihood))) } starting.values <- c(beta0=0, beta1=0,logitp=0) opt.out <- optim(starting.values, negLogLikeocc, y=y, x=vegHt,J=J,hessian=TRUE)

N-mixture model

Continuous case: numerical integration

Snowshoe hare data, see Royle and Dorazio (2008, chapter 6) # FREQUENCIES captured 0, J=14 times: nx<-c(14,34, 16, 10, 4, 2, 2,0,0,0,0,0,0,0,0) nind<-sum(nx) J<-14 Mhlik<-function(parms){ mu<-parms[1] sigma<-exp(parms[2]) il<-rep(NA,J+1) for(k in 0:J){ il[k+1]<-integrate( function(x){ dbinom(k,J,plogis(x))*dnorm(x,mu,sigma) },lower=-Inf,upper=Inf)$value } -1*( sum(nx*log(il)) ) } tmp<-nlm(Mhlik,c(-1,-1 ),hessian=TRUE) sqrt(diag(solve(tmp$hessian)))

Part III: Bayesian inference

Bayes’ rule

Bayes’ rule, continued

Bayesian inference

The Posterior distribution

Computing the posterior distribution 1. Do the math. Recognize the mathematical form of the posterior as a standard named distribution that we can compute moments of. 2. Monte Carlo approximation -- draw samples from the posterior distribution and quantify posterior features by summarizing the samples. Markov chain Monte Carlo (MCMC).

Computing the posterior distribution

Computing the posterior distribution: MCMC

How to do Bayesian analysis: MCMC

The Metropolis Algorithm

Illustration

Illustration of MCMC using Metropolis Algorithm # 2 binomial observations y<- rbinom(2,size=10, p =.5) # The joint distribution function. As a function of data it gives # the probability of any two values of data=c(y1, y2) jointdis<- function(data,K,p){ prod(dbinom(data, size=K, p=p)) } (jointdis(y, K=10, p =.5)) # also happens to be the likelihood of the value p =.5 for the # given data, but it is NOT a probability for p. # Evaluate the likelihood for a grid of values of "p" p.grid<- seq(.1,.9,,200) likelihood<- rep(NA,200) for(i in 1:200){ likelihood[i]<- jointdis(y, K = 10, p=p.grid[i]) } # Plot the likelihood plot(p.grid,likelihood,xlab="p", ylab="likelihood")

Illustration of MCMC using Metropolis Algorithm

## Do MCMC iterations using the metropolis algorithm ## Assume uniform prior which is beta(1,1) mcmc.iters< out<- rep(NA,mcmc.iters) # starting value p<-.2 for(i in 1:mcmc.iters){ # use a uniform candidate generator. This is not efficient p.cand <- runif(1,0,1) r<- posterior(y,K=10,p=p.cand,a=1,b=1)/posterior(y,K=10,p=p,a=1,b=1) # generate a uniform r.v. and compare with "r", this imposes the # correct probability of acceptance if(runif(1) < r) # This is how you “do something” with probability r p<- p.cand out[i]<- p }

Likelihood vs. posterior

The posterior of a function of a model parameter

Remarks on the Metropolis Algorithm Heuristic: This algorithm has us simulate candidate values somehow, even arbitrarily, and then accept values that have higher posterior probability The long-run frequency of ``accepted'' values is that of the target posterior density! Note: If the prior is constant, this MCMC calculation is based on repeated evaluations of the likelihood only. So, if you write a function to do MLE you can also do MCMC.

Summary thoughts on Bayesian/classical inference

Idealized Structure of Workshop/Book Introduction to a class of models Likelihood analysis of models in unmarked Stressing consistent work flow and ease of doing standard things like prediction and model selection Bayesian analysis in BUGS Illustration of a type of model that can't be done (easily, or in unmarked ) using likelihood methods.