Bayesian statistics 2 More on priors plus model choice.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

MCMC estimation in MlwiN
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Estimation in Sampling
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Chapter 8: Estimating with Confidence
Confidence Intervals for Proportions
Bayesian statistics – MCMC techniques
Visual Recognition Tutorial
Making rating curves - the Bayesian approach. Rating curves – what is wanted? A best estimate of the relationship between stage and discharge at a given.
Maximum likelihood (ML) and likelihood ratio (LR) test
Bayesian estimation Bayes’s theorem: prior, likelihood, posterior
CS 589 Information Risk Management 6 February 2007.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
Statistical Background
Visual Recognition Tutorial
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Applied Bayesian Analysis for the Social Sciences Philip Pendergast Computing and Research Services Department of Sociology
Inferences About Process Quality
Thanks to Nir Friedman, HU
Maximum likelihood (ML)
Chapter 10: Estimating with Confidence
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
1 1 Slide Statistical Inference n We have used probability to model the uncertainty observed in real life situations. n We can also the tools of probability.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
1 Probability and Statistics  What is probability?  What is statistics?
Statistical Decision Theory
Model Inference and Averaging
Bayesian Inference, Basics Professor Wei Zhu 1. Bayes Theorem Bayesian statistics named after Thomas Bayes ( ) -- an English statistician, philosopher.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Harrison B. Prosper Workshop on Top Physics, Grenoble Bayesian Statistics in Analysis Harrison B. Prosper Florida State University Workshop on Top Physics:
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
St5219: Bayesian hierarchical modelling lecture 2.1.
Bayesian Analysis and Applications of A Cure Rate Model.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Statistical modelling and latent variables. Constructing models based on insight and motivation.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
Likelihood function and Bayes Theorem In simplest case P(B|A) = P(A|B) P(B)/P(A) and we consider the likelihood function in which we view the conditional.
Bayesian statistics Probabilities for everything.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Week 41 Estimation – Posterior mean An alternative estimate to the posterior mode is the posterior mean. It is given by E(θ | s), whenever it exists. This.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Sampling and estimation Petter Mostad
Trond Reitan (Division of statistics and insurance mathematics, Department of Mathematics, University of Oslo) Statistical modelling and latent variables.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #24.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
What’s the Point (Estimate)? Casualty Loss Reserve Seminar September 12-13, 2005 Roger M. Hayne, FCAS, MAAA.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Gibbs Sampling and Hidden Markov Models in the Event Detection Problem By Marc Sobel.
STATISTICS People sometimes use statistics to describe the results of an experiment or an investigation. This process is referred to as data analysis or.
Bayesian Inference: Multiple Parameters
Oliver Schulte Machine Learning 726
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Bayes Net Learning: Bayesian Approaches
More about Posterior Distributions
Bayesian Inference, Basics
Bayesian Statistics on a Shoestring Assaf Oron, May 2008
Presentation transcript:

Bayesian statistics 2 More on priors plus model choice

Bayes formula Prior: f(  ) – needs to be specified! Posterior: f(  |D) – we can get this if we can calculate f(D). Marginal data density, f(D): The normalization. The reason we do MCMC sampling rather than calculate everything analytically.

Priors Priors are there to sum up what you know before the data. Sometimes you may have sharp knowledge (mass>0), sometimes “soft”. ( Ex: It seems reasonable to me that the average elephant mass is between 4 and 10 tons, but I’m not 100% certain. ) Ideally, we would like to have an informed opinion about the prior probability for any given parameter interval. ( unrealistic ) In practise, you choose a form for the probability density and choice on who’s parameter (hyper-parameters) gives you reasonable expectations or quantiles. For instance, choose hyper-parameters that give you a reasonable 95% credibility interval (a interval for which there is 95% probability that the parameter will be found.) The properties of the prior can be extracted from yourself, from (other) experts in the field from which the data comes or from general knowledge of the data.

Typical prior distributions Normal distribution: For a parameter that can take any real value, but where you have a notion about in which interval it ought to be. (Alt: t-distribution). Lognormal distribution: For a strictly positive valued parameter, where you have a notion about in which interval it ought to be. (Alt: gamma and inverse-gamma). Uniform distribution from a to b: When you really don’t have a clue what the parameter values can be except it’s definitely not outside a given interval. Beta-distribution: Strictly between 0 and 1 (a rate), but you can say you have a notion of where the value is most likely to be found. (Generalization of U(0,1)). Dirichlet: Same as beta, except instead of getting two rates (p and 1-p) you get a set that sums to one. ( For the multinomial case ).

The beta distribution Since rates pop up ever so often, it’s good to have a clear understanding of the beta-distribution (the standard prior form for rates). 1.It has non-zero density only inside the interval (0,1), but can be rescaled to go from any fixed limits. 2.It has two parameters, a and b, which allows you to choose any 95% credibility interval inside (0,1). 3.a=b=1 describes the uniform distribution, U(0,1). 4.a=1, b large describes a very low rate (almost 0). 5.b=1, a large describes a very high rate (almost 1). 6.The higher a+b is the less variance (the more precise) it is. 7.The mean value is a/(a+b). (The mode is (a-1)/(a+b-2).)

Prior-data conflict Some times our choice of prior can clash with reality. Ex: I may have been given the impression that the detection rate for elephants in an area is somewhere between 30% and 40%, when it’s really closer to 80%. Can be detected by: 1.Observing that the posterior mean is somewhere near the limit of our prior 95% cred. interval, or even outside. 2.Seeing that a wider prior or one with a different form gives different results (robustness). 3.Bayesian conflict measures (hard) General advice when making a prior is to think to yourself: “what if I’m wrong?” Avoid priors with hard limits, when there’s even the slightest possibility that reality is found outside those limits.

Non-informative priors If there does not seem to be a well-founded way of specifying a prior for a given parameter, you can either... i.Use a so-called non-informational prior. ii.Find a model where all the parameters actually mean something! Non-informative priors: Flat between 0 and 1 for probabilities/rates (proper). Flat between -  and +  for real-valued parameters (improper). Let the logarithm have a flat distribution if it’s a scale-parameter (improper). Pros:  Often, no one will criticise your prior.  You don’t have to spend time thinking about what is and is not a reasonable parameter value. Cons:  Non-informative priors are usually not proper distributions. Problematic interpretation.  A non-proper prior may lead to a non-proper posterior.  Standard Bayesian model choice is impossible.

Hypothesis testing/ model comparison - Bayesian model probabilities Bayesian inference can be performed on models as well as model parameters and it’s again Bayes theorem that is being used! In a sense, we just introduce a new discreet variable, M, describing the model we will use. Can also define the Bayes-factor, the change in model probabilities due to the data: But, may not be analytically available! MDL  The data can thus be used for inference on M.

Bayesian model probabilities – dealing with the marginal data density may not be analytically available! Possible solutions: Numeric integration Sample-based methods (harmonic mean, importance sampling on adapted proposal dist.) Reversible jumps – MCMC-jumping between models. Probability is estimated from the time the chain spends inside each model.  Pros: Don’t waste time on improbable models  Cons: Difficult to make, difficult to avoid mistakes, difficult to detect mistakes, difficult to make efficient. Latent variable for model choice. Sharp but finitely wide priors for parameters that should be fixed according to a zero-hypothesis. (“Reversible jumps light”)  M is equivalent to a choice of hyper-parameters, which define the priors on . (We’re making discrete inference on the hyper-parameters).  Pros: Much easier to make than Reversible Jumps. It can be more realistic to test a hypothesis  10 than to test  =10.  Cons: Over-pragmatic? Restricted (how to deal with non-nested models?). Beware of technical issues (is the support the same?). Can be fairly inefficient when the prior becomes very sharp.

Model comparison – pragmatic alternatives Parallel to classic approach: If  =  0 is to be tested, check whether  0 is inside the posterior 95% credibility interval...  Pros: Easy to check.  Cons: Throwback to classical model testing. Doesn’t make that much sense in a Bayesian setting. Restricted use. DIC: Parallel to classic AIC. Want the model with the smallest value. Balances model fit (average log-likelihood) and model complexity, p D.  Pros: Easy to calculate once you have MCMC samples. Implemented in WinBUGS. Has become a standard alternative to model probabilities because of this. Takes values even when the prior is non-informative.  Cons: Difficult to understand it’s meaning. What does it represent? What’s a big or small difference in DIC? Counts latent variables in p D if they are in the sampling but not if they are handled analytically. Same model, different results! This has consequences for the occupancy model!