Bayesian approach to the binomial distribution with a discrete prior

Slides:

Advertisements

Similar presentations

Bayes rule, priors and maximum a posteriori

Advertisements

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

Binomial Distribution & Bayes’ Theorem. Questions What is a probability? What is the probability of obtaining 2 heads in 4 coin tosses? What is the probability.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

Probability and Sampling: Part I. What are the odds? What are the odds of finding a newspaper at the news stand? New York times Village Voice Jerusalem.

Parameter Estimation using likelihood functions Tutorial #1

Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.

Probability theory Much inspired by the presentation of Kren and Samuelsson.

Bayesian learning finalized (with high probability)

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Presenting: Assaf Tzabari

Machine Learning CMPT 726 Simon Fraser University

Introduction to Bayesian statistics Three approaches to Probability  Axiomatic Probability by definition and properties  Relative Frequency Repeated.

Thanks to Nir Friedman, HU

Learning Bayesian Networks (From David Heckerman’s tutorial)

R. Kass/S07 P416 Lec 3 1 Lecture 3 The Gaussian Probability Distribution Function Plot of Gaussian pdf x p(x)p(x) Introduction l The Gaussian probability.

Crash Course on Machine Learning

Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.

1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.

Chapter 5 Sampling Distributions

Statistical Decision Theory

Binomial Distributions Calculating the Probability of Success.

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

Bayesian inference review Objective –estimate unknown parameter  based on observations y. Result is given by probability distribution. Bayesian inference.

Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.

Theory of Probability Statistics for Business and Economics.

 A probability function is a function which assigns probabilities to the values of a random variable.  Individual probability values may be denoted by.

BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.

Introduction to Behavioral Statistics Probability, The Binomial Distribution and the Normal Curve.

Psyc 235: Introduction to Statistics DON’T FORGET TO SIGN IN FOR CREDIT!

Likelihood function and Bayes Theorem In simplest case P(B|A) = P(A|B) P(B)/P(A) and we consider the likelihood function in which we view the conditional.

Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Welcome to MM570 Psychological Statistics

Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.

Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.

Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

The Uniform Prior and the Laplace Correction Supplemental Material not on exam.

ELEC 303 – Random Signals Lecture 17 – Hypothesis testing 2 Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 2, 2009.

Review. Common probability distributions Discrete: binomial, Poisson, negative binomial, multinomial Continuous: normal, lognormal, beta, gamma, (negative.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

R. Kass/W04 P416 Lec 3 1 Lecture 3 The Gaussian Probability Distribution Function Plot of Gaussian pdf x p(x)p(x) Introduction l The Gaussian probability.

Bayesian Estimation and Confidence Intervals Lecture XXII.

MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.

Binomials GrowingKnowing.com © 2011 GrowingKnowing.com © 2011.

Better visualization of the Metropolitan algorithm

Bayesian Estimation and Confidence Intervals

Lecture 1 Probability and Statistics

Physics 114: Lecture 7 Probability Density Functions

Appendix A: Probability Theory

From last time: on-policy vs off-policy Take an action Observe a reward Choose the next action Learn (using chosen action) Take the next action Off-policy.

The Gaussian Probability Distribution Function

Bayes Net Learning: Bayesian Approaches

Bayes for Beginners Stephanie Azzopardi & Hrvoje Stojic

Chapter 5 Sampling Distributions

Chapter 5 Sampling Distributions

Binomial Distribution & Bayes’ Theorem

Inference Concerning a Proportion

More about Posterior Distributions

Advanced Artificial Intelligence

Chapter 5 Sampling Distributions

Honors Statistics From Randomness to Probability

Random Variables Binomial Distributions

Bayes for Beginners Luca Chech and Jolanda Malamud

CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis

Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.

M248: Analyzing data Block A UNIT A3 Modeling Variation.

Mathematical Foundations of BME Reza Shadmehr

Presentation transcript:

Bayesian approach to the binomial distribution with a discrete prior The beta distribution Bayesian approach to the binomial distribution with a continuous prior Implementation of the beta distribution Controversy in frequentist vs. Bayesian approach to inference

In the binomial distribution, we assume that p (hereafter ∏) is constant and we calculate the probability for each success… For a fair coin, p=∏ = 0.5 ; I toss the coin 50 times; k= number of heads ( ) n k pk (1-p)n-k = n! * pk (1-p)n-k k!(n-k)! As the number of coin flips increases, this distribution approaches the normal distribution The sum of dbinom(1:50,50,.5) is 1. Because when I flip a coin 50 times, I will get between 0 and 50 heads, So the sum(probability(allEvents)) = 1 P(numHeads | ∏ ) for every possible number of Heads in 50 tosses A likelihood distribution…

Let’s turn the problem around. Given some set of data which we call Y1,Y2,Y3,Y4,Y5,Y6,….,YN For example ( “HTTHTTH” ) where H and T are tails. (We’ll just use “Y” for short to describe our data) How do we calculate: P( ∏ | Y) This is our posterior probability distribution given some string of data… We know…. prior P( ∏ | Y) ~ p(Y| ∏ ) * p(∏) li Posterior likelihood Our prior’s and likelihood’s can be continuous or discrete…. We’ll see in a bit that we can use the binomial as a likelihood and the beta distribution as a prior. But let’s consider a discrete prior….

So let’s say that we are unsure about the true value of ∏ We are 1/3 sure that it is 0.3, 1/3 sure that it is 0.5, 1/3 sure that is 0.7. There are 3 possible state in our Bayesian universe. Whatever data we observe, ∏ can only ever have values of 0.3, 0.5 and 0.7 Y “H” “T” Marginal probs prior ∏1 =0.3 1/3 1/3 * 0.3 1/3 * 0.7 1/3 1/3 * 0.5 1/3 * 0.5 1/3 X ∏ 2 =0.5 1/3 ∏ 3 =0.7 1/3 * 0.7 1/3 * 0.3 1/3 1/3 Marginal probs p(H) = 1/3(0.3 + 0.5 + 0.7) = 0.5 p(T) = 1/3(0.7 + 0.5 + 0.3) = 0.5

If we observe a “Head” P( ∏ 1 | “H”) = (1/3) * 0.3 / 0.5 = 0.2 P( ∏ 2 | “H”) = (1/3) * 0.5 / 0.5 = 0.3333 P( ∏ 2 | “H”) = (1/3) * 0.7 / 0.5 = 0.4667 We become more sure that the “true” probability is 0.7 and less sure that is 0.3 Y “H” “T” Marginal probs prior ∏1 =0.3 1/3 1/3 * 0.3 1/3 * 0.7 1/3 1/3 * 0.5 1/3 * 0.5 1/3 X ∏ 2 =0.5 1/3 ∏ 3 =0.7 1/3 * 0.7 1/3 * 0.3 1/3 1/3 Marginal probs p(H) = 1/3(0.3 + 0.5 + 0.7) = 0.5 p(T) = 1/3(0.7 + 0.5 + 0.3) = 0.5

If we observe a “Tail” P( ∏ 1 | “T”) = (1/3) * 0.7 / 0.5 = 0.4667 P( ∏ 2 | “T”) = (1/3) * 0.5 / 0.5 = 0.3333 P( ∏ 2 | “T”) = (1/3) * 0.3 / 0.5 = 0.2 We become more sure that the “true” probability is 0.3 and less sure that is 0.7 Y “H” “T” Marginal probs prior ∏1 =0.3 1/3 1/3 * 0.3 1/3 * 0.7 1/3 1/3 * 0.5 1/3 * 0.5 1/3 X ∏ 2 =0.5 1/3 ∏ 3 =0.7 1/3 * 0.7 1/3 * 0.3 1/3 1/3 Marginal probs p(H) = 1/3(0.3 + 0.5 + 0.7) = 0.5 p(T) = 1/3(0.7 + 0.5 + 0.3) = 0.5

In R…. for a Head

In R…. for a tail

Obesrving a “Head” and then a tail P( ∏ 1 | “HT”) = 0.2 * 0.7 /0.44651 = 0.313 P( ∏ 2 | “HT”) = 0.333 * 0.5 / 0.44651 = 0.373 P( ∏ 2 | “HT”) = 0.4667 * 0.3 / 0.44651= 0.313 We become more sure that the “true” probability is 0.7 and less sure that is 0.3 Y “H” “T” Marginal probs prior ∏1 | H=0.2 0.2 0.2 * 0.3 0.2 * 0.7 0.2 0.333 * 0.5 0.333 * 0.5 1/3 ∏ 2 | H=0.333 0.3333 ∏ 3 =0.4667 0.4667 * 0.7 0.4667 * 0.3 0.4667 0.4667 Marginal probs p(H) = 0.55319 p(T) = 0.44651 Notice that we don’t return to the uniform prior. We are more certain that p(Heads) =0.5 and less certain that the coin is in either of the other states…

In R for (“HT”)…

Which is the same for (“TH”) although we get there in a different way…

Updating one a at time for p(head) = .6 https://github.com/afodor/afodor.github.io/blob/master/classes/stats2015/discretPriors.txt

The requirement that ∏ can only ever have values of 0. 3, 0. 5 and 0 The requirement that ∏ can only ever have values of 0.3, 0.5 and 0.7 is not really appropriate for our model…

These instabilities together with chance runs at the beginning lead us to different results when we run the model.. Clearly a continuous prior is more appropriate

Bayesian approach to the binomial distribution with a discrete prior The beta distribution Bayesian approach to the binomial distribution with a continuous prior Implementation of the beta distribution Controversy in frequentist vs. Bayesian approach to inference

We can use the continuous beta distribution to describe my beliefs about all possible values of ∏ p(∏ | Y) can be given by the beta distribution! http://en.wikipedia.org/wiki/Beta_distribution When used to model the results of the binomial distribution, α is related to the number of successes and β is related to the number of failures….

As usual in R, we have dbeta, pbeta, qbeta and rbeta.. We can think of α (shape1 in R) as (number of observed successes+1) and β (shaple2 in R) as (number of observed failures+1) (proof of that coming up!)

The rule is to add 1 to the number of successes and failures So we use α and β as the shape constants and the beta distribution gives us the probability density of ∏. In each plot (i.e. for each set of values for α and β), we are holding the results of the experiment constant and varying possible values of ∏ from 0 to 1) 10 heads 40 tails 25 heads 25 tails (prob of the coin generating a head|25 heads,25tails) (prob of the coin generating a head|10 heads,40tails) The rule is to add 1 to the number of successes and failures

An uniformed prior. My beliefs before I see any data (the uniform distribution!) After seeing one head and one tail 0 heads 0 tails 1 heads 1 tails

If I integrate the beta distribution from 0 to 1, the result is 1. Conceptually, for a given result, the sum of the probabilities of all the possible values of ∏ is 1 The beta function guarantees an integral of 1 along ∏ = {0,1}

Bayesian approach to the binomial distribution with a discrete prior The beta distribution Bayesian approach to the binomial distribution with a continuous prior Implementation of the beta distribution Controversy in frequentist vs. Bayesian approach to inference

Bayes law – Incorporating new data. We have a prior belief about some distribution. Say we don’t think there is ESP based on experiments with 18 people. (Nine people guessed right; nine people guessed wrong) Our prior probability distribution = ∏prior = g(∏) = beta(10,10) We have a new set of data (we call Ynew): 14 people choose right, 11 choose wrong. We want to update our model: For all ∏ along the range 0 to 1, we define p(∏) as the probability given by beta p(∏ , Ynew ) = p(∏) * p(Ynew | ∏ ) p(Ynew , ∏ ) = p(∏ | Ynew ) * p(Ynew) p(∏ | Ynew ) = p(∏) * p(Ynew | ∏ ) p(Ynew) If we can calculate this along ∏ = {0,1} then p(∏ | Ynew ) will describe a new distribution which is our updated belief about all values of ∏ between {0,1} given the new data

For all ∏ along the range 0 to 1: This is the prior probability. What we believe about the probability of each value of ∏ before we see the new data. p(∏ | Ynew ) = p(∏) * p(Ynew | ∏ ) p(Ynew) This is the “likelihood probability”. In this case, it comes from the binomial This is the “posterior” probability. Our belief of the probability of each value of ∏ after we see the new data. What about p(Ynew)? This it the probability of observing our data summed across all value of ∏. That is: p(Ynew) =

p(∏ | Ynew ) = p(∏) * p(Ynew | ∏ ) We can set any prior distribution we want, but there are good reasons to choose a prior that is beta distributed. a =10; b= 10 – the “shape” parameters based on our old data…. We choose as our prior – beta(10,10) beta(10,10) =

p(∏ | Ynew ) = p(∏) * p(Ynew | ∏ ) aold = bold = 10 (Our first set of data where 9 subjects guessed right and 9 wrong) anew= 14; bnew = 11 – Our new data where 14 guess right and 11 guess wrong We want to calculate our posterior distribution given our new data: p(∏ | Ynew ) (beta prior) p(∏) = beta(aold,bold) = (binomial likelihood) p(Ynew | ∏ ) = p(∏) * p(Ynew | ∏ ) = (prior * likelihood) p(∏ | Ynew ) = (Bayes law)

p(∏ | Ynew ) = = = Let k’ = k’ = * k’ But this integral is 1

So we have this rather startling result… p(∏ | Ynew ) = To update our models, we just add the new successes to aold and new failures to bold and call dbeta… We have more data so the variance is smaller. There were a few more successes, so the curve has shifted to the right

The beta distribution is the conjugate prior of the binomial distribution. Multiplying a beta prior by a bionomial likelihood yields a beta posterior p(∏ | Ynew ) = p(∏) * p(Ynew | ∏ )

If you have no data and no beliefs, you probably want a uniform prior… Remember the uniform distribution? We have no expectations. The prior probability is always 1..

We can watch our Bayesian framework “learn” the distribution. Consider a 3:1 Mendelian phenotype experiment (with perfect data). Pretty sweet!

Our updating R code gets much simpler… https://github.com/afodor/afodor.github.io/blob/master/classes/stats2015/bayesianUpdater.txt

By the law of large numbers, as we get more data, the width of our beta distribution decreases

P(Dloaded|3 sixes) = P(3 sixes| Dloaded) * P(Dloaded) The application of Bayes law always follows the same form Posterior – our belief after seeing the data that we have a loaded dice Prior – our original belief that We had a loaded die. The likelihood function P(3 sixes) The “integral”: summing over all possible models ( p(3 sixes|fairDie) + p(3sixes|loadedDie) This is the prior probability. What we believe about the probability of each value of ∏ before we see the new data. p(∏ | Ynew ) = p(∏) * p(Ynew | ∏ ) p(Ynew) This is the “likelihood probability”. In this case, it comes from the binomial The integral summing over all values of ∏ This is the “posterior” probability. Our belief of the probability of each value of ∏ after we see the new data.

Bayesian approach to the binomial distribution with a discrete prior The beta distribution Bayesian approach to the binomial distribution with a continuous prior Implementation of the beta distribution Controversy in frequentist vs. Bayesian approach to inference

We can port the code for the beta and gamma functions from Numerical Recipes…

We start with the gamma function…. (or actually lngamma ())

This is straight-forward to port… Our results are within error to R…

Likewise, you can port over the beta distribution (which the book calls the incomplete beta distribution described by function betai). So you can easily have access to these distributions in the programming language of your choice

Bayesian approach to the binomial distribution with a discrete prior The beta distribution Bayesian approach to the binomial distribution with a continuous prior Implementation of the beta distribution Controversy in frequentist vs. Bayesian approach to inference

http://www. nytimes. com/2011/01/11/science/11esp. html http://www.nytimes.com/2011/01/11/science/11esp.html?_r=1&scp=1&sq=esp&st=cse

This is p( 527 | coin is fair) / max( p ( 527 | coin is loaded ) ) p(people have ESP ) / p( people don’t) = ~ 4: 1 (you would see positive results by chance 25% of the time). This is our first hint of a Bayesian approach to inference My guess is that other factors (not correcting for multiple tests, not running a two-sided test, not reporting negative results, etc) mattered more for “ESP” than a “Bayesian” vs. “classical” analysis, but that article gives a sense of some of the arguments

Coming up: Bayesian vs. Frequentest approach to hypothesis testing for the Binomial distribution. Numerical approximation in the Bayesian universe The Poisson distribution and RNA-seq