Bayesian Statistics Lecture 8 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Bayesian Estimation in MARK
Statistics review of basic probability and statistics.
Introduction to Statistics
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian statistics – MCMC techniques
BAYESIAN INFERENCE Sampling techniques
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Evaluating Hypotheses
Presenting: Assaf Tzabari
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Applied Bayesian Analysis for the Social Sciences Philip Pendergast Computing and Research Services Department of Sociology
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Hypothesis Testing:.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Hypothesis Testing in Linear Regression Analysis
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
Statistical Decision Theory
Model Inference and Averaging
Bayesian Inference, Basics Professor Wei Zhu 1. Bayes Theorem Bayesian statistics named after Thomas Bayes ( ) -- an English statistician, philosopher.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian vs. frequentist inference frequentist: 1) Deductive hypothesis testing of Popper--ruling out alternative explanations Falsification: can prove.
Bayesian statistics Probabilities for everything.
Randomized Algorithms for Bayesian Hierarchical Clustering
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Probabilistic Robotics Introduction Probabilities Bayes rule Bayes filters.
- 1 - Outline Introduction to the Bayesian theory –Bayesian Probability –Bayes’ Rule –Bayesian Inference –Historical Note Coin trials example Bayes rule.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Probabilistic Robotics Probability Theory Basics Error Propagation Slides from Autonomous Robots (Siegwart and Nourbaksh), Chapter 5 Probabilistic Robotics.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Stochasticity and Probability. A new approach to insight Pose question and think of the answer needed to answer it. Ask: How do the data arise? What is.
Markov Chain Monte Carlo in R
15 Inferential Statistics.
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Advanced Statistical Computing Fall 2016
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Bayesian inference Presented by Amir Hadadi
Markov Chain Monte Carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Bayesian Inference, Basics
Statistical NLP: Lecture 4
PSY 626: Bayesian Statistics for Psychological Science
Parametric Methods Berlin Chen, 2005 References:
CS639: Data Management for Data Science
Presentation transcript:

Bayesian Statistics Lecture 8 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

“Real knowledge is to know the extent of one’s ignorance” -Confucius

How do we measure our knowledge (ignorance)? Scientific point of view: Knowledge is acceptable if it explains a body of natural phenomena (Scientific model). Statistical point of view: Knowledge is uncertain but we can use it if we can measure its uncertainty. The question is how to measure uncertainty and make use of available knowledge.

Limitations of likelihoodist & frequentist approaches Parsimony is often an insufficient criterion for inference particularly if our objective is forecasting. Model selection uncertainty is the big elephant in the room. Since parameters do not have probability distributions, error propagation in models cannot be interpreted in a probabilistic manner. Cannot deal with multiple sources of error and complex error structures in an efficient way. New data require new analyses.

Standard statistics revisited: Complex Variance Structures

Inference Addresses three basic questions: 1.What do I believe now that I have these data? [Credibility or confidence] 2.What should I do now that I have these data? [Decision] 3.How should I interpret these data as evidence of one hypothesis vs. other competing hypotheses? [Evidence]

Body of knowledge Scientific Model Scientific Hypothesis Statistical Model DATA Statistical Hypothesis

Body of knowledge= Fruit production in trees Scientific Explanation = physiology, Life history Scientific Hypothesis y i = DBH b Statistical Model= Poisson dist. DATA Statistical Hypothesis b = value Pred (y) An example

The Frequentist Take b = 0.4 Belief: Only with reference to an infinite series of trials Decision: Accept or reject that b=0 Evidence: None Body of knowledge= Fruit production in trees Scientific Explanation = physiology Scientific Hypothesis Log y i = b log(DBH) Statistical Model= Poisson dist. DATA Statistical Hypothesis b  0

The Likelihodist Take b = 0.4 Belief: None, only relevant to the data at hand. Decision: Only with reference to alternate models Evidence: Likelihood Ratio Test or AIC. Body of knowledge= Fruit production in trees Scientific Explanation = physiology Scientific Hypothesis Log y i = b log(DBH) Statistical Model= Poisson dist. DATA Statistical Hypothesis b  0

The Bayesian Take b = 0.4 Belief: Credible intervals Decision: Parameter in or out of a distribution Evidence: None Body of knowledge= Fruit production in trees Scientific Explanation = physiology Scientific Hypothesis Log y i = b log(DBH) Statistical Model= Poisson dist. DATA Statistical Hypothesis b  0

Parallels and differences in Bayesian & Frequentist statistics Bayesian and frequentist approaches use the data to derive a parameter estimate and a measure of uncertainty around the parameter that can be interpreted using probability. In Bayesian inference, parameters are treated as random variables that have a distribution. If we know their distribution, we can assess the probability that they will take on a particular value (posterior ratios or credible intervals).

Evidence vs Probability “As a matter of principle, the infrequency with which, in particular circumstances, decisive evidence is obtained, should not be confused with the force or cogency, of such evidence”. Fischer 1959

Frequentist vs Bayesian Prob = objective relative frequencies Params are fixed unknown constants, so cannot write e.g. P(  =0.5|D) Estimators should be good when averaged across many trials Prob = degrees of belief (uncertainty) Can write P(anything|D) Estimators should be good for the available data Source: “All of statistics”, Larry Wasserman

Frequentism Probability only defined as a long-term average in an infinite sequence of trials (that typically never happen!). p-value is probability of that extreme outcome given a specified null hypothesis. Null hypotheses are often strawmen set up to be rejected Improperly used p values are poor tools for statistical inference. We are interested in parameter estimation rather than p values per se.

Frequentist statistics violates the likelihood principle “The use of p-values implies that a hypothesis that may be true can be rejected because it has not predicted observable results that have not actually occurred.” Jeffreys, 1961

Some rules of probability assuming independence AB

Bayes Theorem

?

For a set of mutually exclusive hypotheses…..

Bolker

An example from medical testing

ill Not ill Test +

Bayes Theorem Rarely known Hard to integrate function MCMC methods

Joint and Marginal distributions: Probability that 2 pigeon species (S & R) occupy an island EventsSScSc Marginal R29Pr{R}=11/32 RcRc 183Pr{R c }=21/32 MarginalPr{S}=20/32Pr{S c }=11/32N=32 Diamond 1975 EventProbEstimate R given SProb{R|S}2/20 S given RProb{S|R}2/11

Conjugacy In Bayesian probability theory, a conjugate prior is a family of prior probability distributions which has the property that the posterior probability distribution also belongs to that family.Bayesian probabilityprior probability distributions posterior probability distribution A conjugate prior is an algebraic convenience: otherwise a difficult numerical integration may be necessary.

Jointly distributed random variables We have to normalize this to turn it into a probability

Hierarchical Bayes Ecological models tend to be high-dimensional and include many sources of stochasticity. These sources of “noise” often don’t comply with assumptions of traditional statistics: –Independence (spatial or temporal) –Balanced among groups –Distributional assumptions HB can deal with these problems by partioning a complex problem into a series of univariate distributions for which we can solve –typically using sophisticated computational methods.

Hierarchical Bayes

Clark et al. 2004

Hierarchical Bayes Marginal distribution of a parameter averaged over all other parameters and hyperparameters:

Hierarchical Bayes: Advantages 1.Complex models can be constructed from simple, conditional relationships. We don’t need an integrated specification of the problem, only the conditional components. We are drawing boxes and arrows. 2.We relax the traditional requirement for independent data. Conditional independence is enough. We typically take up the relationships that cause correlation at a lower process stage. We can accommodate multiple data types within a single analysis, even treating model output as ‘data’. 3.Sampling based approaches (MCMC) can do the integration for us (the thing we avoided in advantage 1). Complex models can be constructed from simple, conditional relationships. We don’t need an integrated specification of the problem, only the conditional components. We are drawing boxes and arrows (Fig. 4). We relax the traditional requirement for independent data. Condindependence is enough. We typically take up the relationships that cause correlation we can accommodate multiple data types within a single analysis, even treating model output as ‘data’. More on this later. Sampling based approaches (MCMC) can do the integration for us (thething we avoided in advantage 1).

Useful approach for understanding ecological processes because: – Incorporates uncertainty using a probabilistic framework – Model parameters are random variables – output is a probability distribution (the posterior distribution) – Complex models are partitioned into a hierarchical structure – Performs well for high-dimensional models (ie - many parameters) with little data Why Hierarchical Bayes?

Bayes’ Rule p(  |y) = p(  ) * p(y|  ) p(y) Likelihood Prior distribution Posterior distribution Normalizing density  Posterior distribution is affected by the data only through the likelihood function  If prior distribution is non-informative, then the data dominate the outcome p(y) =  p(  )p(y|  )d  (marginal distribution of y or prior predictive distribution)  is set of model parameters y is observed data

How do we do this? Baby steps: Rejection sampling Suppose we have a distribution Target distribution

Bound target distribution with a function f(x) so that Cf(x)>=p(x) Calculate ratio

With prob a accept this value of a random draw from p(x). With probability a-1 reject this value of X and repeat the procedure. To do this draw a random variate (z) from the uniform density. If z<a, accept X. Target distribution Proposed distribution

Build an empirical distribution of accepted draws which approximates the target distribution. Theoretical distribution Smoothed empirical distribution

MCMC Methods Markov process – a random process whose next step depends only on the prior realization (lag of 1) The joint probability distribution (p(  |y), which is the posterior distribution of the parameters) is generally impossible to integrate in closed form So…use a simulation approach based on conditional probability The goal is to sample from this joint distribution of all parameters in the model, given the data, (the target distribution) in order to estimate the parameters, but… …we don’t know what the target distribution looks like, so we have to make a proposal distribution

Monte Carlo principle Given a very large set X and a distribution p(x) over it We draw i.i.d. a set of N samples We can then approximate the distribution using these samples X p(x)

Markov Chain Monte Carlo (MCMC) Recall again the set X and the distribution p(x) we wish to sample from Suppose that it is hard to sample p(x) but that it is possible to “walk around” in X using only local state transitions Insight: we can use a “random walk” to help us draw random samples from p(x) X p(x)

MCMC Methods Metropolis-Hastings algorithms are a way to construct a Markov chain in such a way that its equilibrium (or stationary) distribution is the target distribution. – Proposal is some kind of bounding distribution that completely contains the target distribution – Acceptance-rejection methods are used to decide whether a proposed value is accepted or rejected as being drawn from the target distribution – Jumping rules determine when and how the chain moves on to new proposal values

MCMC The basic rule is that the ratio of successful jump probabilities is proportional to the ratio of posterior probabilities. This means that over the long term, we stay in areas with high probability and the long-term occupancy of the chain matches the posterior distribution of interest.

MCMC Methods Eventually, through many proposals that are updated iteratively (based on jumping rules), the Markov chain will converge to the target distribution, at which time it has reached equilibrium (or stationarity) This is achieved after the so-called “burn-in” (“the chain converged”) Simulations (proposals) made prior to reaching stationarity (ie - during burn-in) are not used in estimating the target Burning questions: When have you achieved stationarity and how do you know??? (some diagnostics, but no objective answer because the target distribution is not known)

More burning questions How can you pick a proposal distribution when you don’t know what the target distribution is? (this is what M-H figured out!) – Series of proposals depends on a ratio involving the target distribution, which itself cancels out in the ratio – So you don’t need to know the target distribution in order to make a set of proposals that will eventually converge to the target – This is (vaguely) analogous in K-L information theory to not having to “know the truth” in order to estimate the difference between 2 models in their distance from the truth (truth drops out in the comparison)

Posterior Distributions Assuming the chain converged, you obtain an estimate for each parameter of its marginal distribution, p(  1 |  2,  3 …  n, y) That is, the distribution of  1, averaged over the distributions for all other parameters in the model & given the data This marginal distribution is the posterior distribution that represents the probability distribution of this parameter, given the data & other parameters in the model These posterior distributions of the parameters are the basis of inference

Assessing Convergence Not converged Run multiple chains (chains are independent) Many iterations (>2000) First half are burn-in “Thin” the chain (take every x th value; depends on auto-correlation) Compare traces of chains Chain 1 Chain 2 Chain 3

Assessing Convergence Estimate, which is the square root of the ratio of the variance within chains vs. variance between chains At stationarity and as n  ,  1 Above ~ 1.1, chains may not have converged, and a greater number of simulations is recommended