Likelihood function and Bayes Theorem In simplest case P(B|A) = P(A|B) P(B)/P(A) and we consider the likelihood function in which we view the conditional.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Probabilistic models Haixu Tang School of Informatics.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Binomial Distribution & Bayes’ Theorem. Questions What is a probability? What is the probability of obtaining 2 heads in 4 coin tosses? What is the probability.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Probability theory Much inspired by the presentation of Kren and Samuelsson.
Bayesian learning finalized (with high probability)
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Further advanced methods Chapter 17.
Lecture 9: p-value functions and intro to Bayesian thinking Matthew Fox Advanced Epidemiology.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
A quick intro to Bayesian thinking 104 Frequentist Approach 10/14 Probability of 1 head next: = X Probability of 2 heads next: = 0.51.
Statistical Decision Theory
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
The Posterior Distribution. Bayesian Theory of Evolution!
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
DNA Identification: Bayesian Belief Update Cybergenetics © TrueAllele ® Lectures Fall, 2010 Mark W Perlin, PhD, MD, PhD Cybergenetics, Pittsburgh,
Bayesian statistics Probabilities for everything.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
CpSc 881: Machine Learning Evaluating Hypotheses.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
MATH 643 Bayesian Statistics. 2 Discrete Case n There are 3 suspects in a murder case –Based on available information, the police think the following.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Psychology 202a Advanced Psychological Statistics September 22, 2015.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Hierarchical Bayesian Analysis: Binomial Proportions Dwight Howard’s Game by Game Free Throw Success Rate – 2013/2014 NBA Season Data Source:
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
Ivo van Vulpen Why does RooFit put asymmetric errors on data points ? 10 slides on a ‘too easy’ topic that I hope confuse, but do not irritate you
FIXETH LIKELIHOODS this is correct. Bayesian methods I: theory.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Oliver Schulte Machine Learning 726
Bayesian Estimation and Confidence Intervals
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Bayesian approach to the binomial distribution with a discrete prior
BAYES and FREQUENTISM: The Return of an Old Controversy
Bayesian data analysis
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Binomial Distribution & Bayes’ Theorem
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
More about Posterior Distributions
Advanced Artificial Intelligence
Statistical NLP: Lecture 4
LECTURE 07: BAYESIAN ESTIMATION
Bayes for Beginners Luca Chech and Jolanda Malamud
CS639: Data Management for Data Science
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Likelihood function and Bayes Theorem In simplest case P(B|A) = P(A|B) P(B)/P(A) and we consider the likelihood function in which we view the conditional probability as a function of the second argument (what we are conditioning on) rather than as a function of the first argument - e.g. a function that takes b -> P(A|B=b). We define the likelihhod function as an equivalence class of such conditional probabilities L(b|A) = c P(A|B=b) where c is any positive constant. It is the ratio of likelihoods that matters: L(b 1 |A) / L(b 2 |A)

For the case of a probability density function with a parameter c for some observation, f(x;c), the likelihood function is L(c|x) = f(x;c) which is viewed as a function of x with c fixed as a pdf but as a function of c with x fixed as a likelihood. The likelihood is not a pdf. Example - coin toss - p=P(H) so P(HHH) = p 3 and P(HHH| p =.5) = 1/8 = L(p=.5|HHH) but this does not say the probability the coin is fair, given HHH is 1/8. Can view this as having a whole collection of coins, and if you believe it is close to a “fair” collection, then P(p is “near”.5) is close to 1. This would inform the prior distribution you choose.

If we view this as likelihood of data given some hypothesis, Bayes becomes The ratio in the bottom is the odds ratio - if this is near 1 for all hypotheses then posterior s essentially same as prior - we’ve learned nothing. Best if it is near 1 for one hypothesis and small all others.

Bayesian squirrel - 2 large areas with squirrel burying all its food at location 1 w.p. p 1 and all at location 2 w.p. p 2 (p 1 + p 2 = 1) s i = P(find food in location i|search location i and squirrel did bury food there) Then assume squirrel searches location with highest value of s i p i. Question: If squirrel searches location 1 and doesn’t find food there, should it switch to searching location 2 the next day?

If p 1 ’ is posterior then So use this to update the p 1 and p 2 each day, choose the location with highest p i s i to search on the next day and repeat. The table in the book is the case for unsuccessful search - if squirrel does find food in the location a similar procedure updates the p i for the next day but in this case since the squirrel finds food there the posterior p 1 = 1

The Fisher lament example is meant to show that there are cases when if we use prior knowledge, we get results that are non-intuitive if we don’t take a Bayesian view - e.g. the standard frequentist view would put all the probability mass at 0 or 1 no matter what we observe When there is a discrete number of hypotheses the two approaches are essentially the same (but often there is a continuous parameter so this doesn’t apply) but there is a problem with specifying priors if there are no observations.

Binomial Case and conjugate priors (infested tree nuts). If sample S nuts and i are infested with prob p of any nut being infested, gives a binomial form for likelihood. Then finding the posterior involves integrating this over some prior pdf for p and if we choose this prior to be a Beta Distribution (so it is over [0,1]) then he shows in the text that the posterior is also a Beta Distribution with updated parameters - this is called a conjugate prior you get the same family of distribution for posterior as the prior

Once you have a posterior, you can find the Bayesian confidence interval for a parameter in a distribution - e.g. you can get an estimate of how confident you are that the “true” parameter for a model falls in some range - just as you do with any distribution. The influence of the prior distribution can be readily overwhelmed by new data - illustrated in Fig 9.2 and the shape of the posterior may not be affected greatly by the shape of the prior - Fig 9.3. These illustrate that new data have great impact.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being sampled from a prior  (  | ) where the are hyperparameters. If is known then Bayesian updating is If is not known then updating depends upon a distribution h( )the hyperprior

The in this might specify how the parameters vary in space or time between observations which have some underlying stochasticity. One possible approach is to estimate the for example by choosing it to maximize the marginal distribution of the data as a function of by choosing it to maximize Giving an estimate and an estimated posterior This is called an empirical Bayes approach