Download presentation
Presentation is loading. Please wait.
Published byLionel Hardy Modified over 8 years ago
1
1 Bayesian Methods with Monte Carlo Markov Chains I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University hslu@stat.nctu.edu.tw http://tigpbp.iis.sinica.edu.tw/courses.htm
2
2 Part 1 Introduction to Bayesian Methods
3
3 Bayes’ Theorem Conditional Probability: One Derivation: Alternative Derivation: http://en.wikipedia.org/wiki/Bayes'_theorem http://en.wikipedia.org/wiki/Bayes'_theorem
4
4 False Positive and Negative Medical diagnosis: Type I and II Errors: hypothesis testing in statistical inference http://en.wikipedia.org/wiki/False_positive http://en.wikipedia.org/wiki/False_positive Actual Status Disease (H1)Normal (H0) Diagnosis Test Result Positive (Reject H0) True Positive (Power, 1- β ) False Positive (Type I Error, α ) Negative (Accept H0) False Negative (Type II Error, β ) True Negative (Confidence Level, 1- α )
5
5 Bayesian Inference (1) False positives in a medical test Test accuracy by conditional probabilities: P(Test Positive|Disease) = P(R|H1) = 1-β = 0.99 P(Test Negative|Normal) = P(A|H0) = 1-α = 0.95. Prior probabilities: P(Disease) = P(H1) = 0.001 P(Normal) = P(H0) = 0.999.
6
6 Bayesian Inference (2) Posterior probabilities by Bayes ’ theorem: True Positive Probability = P(Disease|Test Positive) = P(H1|R) = False Positive Probability = P(Normal|Test Positive) = P(H0|R) = (1 − 0.019) = 0.981.
7
7 Bayesian Inference (3) Equal Prior probabilities: P(Disease) = P(H1) = P(Normal) = P(H0) = 0.5. Posterior probabilities by Bayes ’ theorem: True Positive Probability = P(Disease|Test Positive) = P(H1|R) = = P(R|H1) = 1-β! http://en.wikipedia.org/wiki/Bayesian_inference http://en.wikipedia.org/wiki/Bayesian_inference
8
8 Bayesian Inference (4) In the courtroom: P(Evidence of DNA Match | Guilty) = 1 and P(Evidence of DNA Match | Innocent) = 10 -6. Based on the evidence other than the DNA match, P(Guilty) = 0.3 and P(Innocent) = 0.7. By the Bayes Theorem, P(Guilty | Evidence of DNA Match) = = 0.99999766667.
9
9 Naive Bayes Classifier Naive Bayes Classifier is a simple probabilistic classifier based on applying Bayes ’ theorem with strong (naive) independence assumptions. http://en.wikipedia.org/wiki/Naive_Bayes_ classifier http://en.wikipedia.org/wiki/Naive_Bayes_ classifier
10
10 Naive Bayes Probabilistic Model (1) The probability model for a classifier is a conditional model P(C|F 1, …,F n ) where C is a dependent class variable and F1, …,Fn are several feature variables. By Bayes ’ theorem,
11
11 Naive Bayes Probabilistic Model (2) Use repeated applications of the definition of conditional probability: P(C,F 1, … F n )=P(C) P(F 1,..F n |C) =P(C)P(F 1 |C)P(F 2,..F n |C,F 1 ) =P(C)P(F 1 |C)P(F 2 |C,F 1 )P(F 3,..,F n |C,F 1,F 2 ) and so forth. Assume that each F i is conditionally independent of every other F j for i≠j, this means that P(F i |C,F j )=P(F i |C). So P(C,F 1, … F n ) can be expressed as.
12
12 Naive Bayes Probabilistic Model (3) So P(C|F 1, …,F n ) cab be expressed like where Z is constant if the values of the feature variables are known. Constructing a classifier from the probability model:
13
13 Bayesian Spam Filtering (1) Bayesian spam filtering, a form of e- mail filtering, is the process of using a Naive Bayes classifier to identify spam email. References: http://en.wikipedia.org/wiki/Spam_%28e- mail%29 http://en.wikipedia.org/wiki/Bayesian_spa m_filtering http://www.gfi.com/whitepapers/why- bayesian-filtering.pdf
14
14 Bayesian Spam Filtering (2) Probabilistic model: where {words} mean {certain words in spam emails}. Particular words have particular probabilities of occurring in spam emails and in legitimate emails. For instance, most email users will frequently encounter the word “ Viagra ” in spam emails, but will seldom see it in other emails.
15
15 Bayesian Spam Filtering (3) Before mails can be filtered using this method, the user needs to generate a database with words and tokens (such as the $ sign, IP addresses and domains, and so on), collected from a sample of spam mails and valid mails. After generating, each word in the email contributes to the email's spam probability. This contribution is called the posterior probability and is computed using Bayes ’ theorem. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.
16
16 Bayesian Network (1) Bayesian network is compact representation of probability distributions via conditional independence For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms http://en.wikipedia.org/wiki/Bayesian_network http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html http://www.cs.huji.ac.il/~nirf/Nips01- Tutorial/index.html http://en.wikipedia.org/wiki/Bayesian_network http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html http://www.cs.huji.ac.il/~nirf/Nips01- Tutorial/index.html
17
17 Bayesian Network (2) Conditional independencies & graphical language capture structure of many real- world distributions Graph structure provides much insight into domain Allows “ knowledge discovery ” Data + Prior Information Learner 1.00.0 T F F 0.10.9 0.01 0.99 0.10.9 T T F T F RSP(W | S,R) Cloudy Sprinkler Rain Wet Grass
18
18 Bayesian Network (3) Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence Quantitative part: Set of conditional probability distributions Together: Define a unique distribution in a factored form 1.00.0 T F F 0.10.9 0.01 0.99 0.10.9 T T F T F RS P(W | S,R) Cloudy Sprinkler Rain Wet Grass
19
19 Inference Posterior probabilities Probability of any event given any evidence Most likely explanation Scenario that explains evidence Rational decision making Maximize expected utility Value of Information Effect of intervention Earthquake Radio Burglary Alarm Call Radio
20
20 Example 1 (1) Cloudy Sprinkler Rain Wet Grass
21
21 Example 1 (2) By the chain rule of probability, the joint probability of all the nodes in the graph above is P(C, S, R, W) = P(C) * P(S|C) * P(R|C, S) * P(W|C, S, R). By using conditional independence relationships, we can rewrite this as P(C, S, R, W) = P(C) * P(S|C) * P(R|C) * P(W|S, R) where we were allowed to simplify the third term because R is independent of S given its parent C, and the last term because W is independent of C given its parents S and R.
22
22 Example 1 (3) Bayes theorem: the posterior probability of each explanation where is a normalizing constant, equal to the probability (likelihood) of the data.
23
23 Example 1 (4) So we see that it is more likely that the grass is wet because it is raining: the likelihood ratio is 0.708/0.430 = 1.647.
24
24 Part 2 MLE vs. Bayesian Methods
25
25 Maximum Likelihood Estimates (MLEs) vs. Bayesian Methods Binomial Experiments: http://www.math.tau.ac.il/~nin/Courses/ML 04/ml2.ppt http://www.math.tau.ac.il/~nin/Courses/ML 04/ml2.ppt More Explanations and Examples: http://www.dina.dk/phd/s/s6/learning2.pdf
26
26 MLE (1) Binomial Experiments: suppose we toss coin N times and the random variable is We denote by the (unknown) probability P(Head). Estimation task: Given a sequence of toss samples x 1, x 2, …, x N we want to estimate the probabilities P(H)= and P(T) = 1 - .
27
27 MLE (2) The number of heads we see has a binomial distribution and thus Clearly, the MLE of is and is also equal to MME of .
28
28 MLE (3) Suppose we observe the sequence H, H. MLE estimate is P(H)=1,P(T)=0. Should we really believe that tails are impossible at this stage? Such an estimate can have disastrous effect. If we assume that P(T) = 0, then we are willing to act as though this outcome is impossible.
29
29 Bayesian Reasoning In Bayesian reasoning we represent our uncertainty about the unknown parameter by a probability distribution. This probability distribution can be viewed as subjective probability This is a personal judgment of uncertainty.
30
30 Bayesian Inference P() - prior distribution about the values of P(x1, …, x N |) - likelihood of binomial experiment given a known value Given x 1, …, x N, we can compute posterior distribution on The marginal likelihood is http://www.dina.dk/phd/s/s6/learning2.pdf http://www.dina.dk/phd/s/s6/learning2.pdf
31
31 Binomial Example (1) In binomial experiment, the unknown parameter is = P(H) Simplest prior: P() = 1 for 0<<1 (Uniform prior) Likelihood: where k is number of heads in the sequence Marginal Likelihood:
32
32 Binomial Example (2) Using integration by parts, we have: Multiply both side by n choose k, we have
33
33 Binomial Example (3) The recursion terminates when k = N, Thus, We conclude that the posterior is
34
34 Binomial Example (4) How do we predict (estimate ) using the posterior? We can think of this as computing the probability of the next element in the sequence Assumption: if we know , the probability of X N+1 is independent of X1, …, X N
35
35 Binomial Example (5) Thus, we conclude that
36
36 Beta Prior (1) The uniform priori distribution is a particular case of the Beta Distribution. Its general form is: Where s = and show as. The expected value of the parameter is: The uniform is Beta(1,1)
37
37 Beta Prior (2) There are important theoretical reasons for using the Beta prior distribution? One of them has also important practical consequences: it is the conjugate distribution of binomial sampling. If the prior is and we have observed some data with N1 and N0 cases for the two possible values of the variable, then the posterior is also Beta with parameters The expected value for the posterior distribution is
38
38 Beta Prior (3) The value represent the prior probabilities for the value of the variables based in our past experience. The value s= is called equivalent sample size measure the importance of our past experience. Larger values make that prior probabilities have more importance.
39
39 Beta Prior (4) When, then we have maximum likelihood estimation
40
40 Multinomial Experiments Now, assume that we have a variable X taking values on a finite set {a 1, …,a n } and we have a serious of independent observations of this distribution, (x 1,x 2, …,x m ) and we want to estimate the value θ i =P(a i ), i=1, …,n. Let N i be the number of cases in the sample in which we have obtained the value a i (i=1, …,n) The MLE of θ i is The problems with small samples are completely analogous
41
41 Dirichlet Prior (1) We can also follow the Bayesian approach, but the prior distribution is the Dirichlet distribution, a generalization of the Beta distribution for more than 2 cases:(θ 1, …, θ n ). The expression of D(, …, ) is where s= is the equivalent sample size.
42
42 Dirichlet Prior (2) The expected vector is Greater value of s makes this distribution more concentrated around the mean vector.
43
43 Dirichlet Posterior If we have a set of data with counts (N 1, …,N n ), then the posterior distribution is also Dirichlet with parameters The Bayesian estimation of probabilities are: where,.
44
44 Multinomial Example Imagine that we have an urn with balls of different colors: red(R), blue(B) and green(G); but on an unknown quantity. Assume that we picked up balls with replacement, with the following sequence: (B,B,R,R,B). If we assume a Dirichlet prior distribution with parameters: D(1,1,1), then the estimated frequencies for red,blue and green : (3/8, 4/8, 1/8) Observe, as green has a positive probability, even if never appears in the sequence.
45
45 Part 3 An Example in Genetics
46
46 Example 1 in Genetics (1) Two linked loci with alleles A and a, and B and b A, B: dominant a, b: recessive A double heterozygote AaBb will produce gametes of four types: AB, Ab, aB, ab F (Female) 1- r ’ r ’ (female recombination fraction) M (Male) 1-r r (male recombination fraction) A Bb a B A b a a B b A A B b a 46
47
47 Example 1 in Genetics (2) r and r ’ are the recombination rates for male and female Suppose the parental origin of these heterozygote is from the mating of. The problem is to estimate r and r ’ from the offspring of selfed heterozygotes. Fisher, R. A. and Balmukand, B. (1928). The estimation of linkage from the offspring of selfed heterozygotes. Journal of Genetics, 20, 79 – 92. http://en.wikipedia.org/wiki/Genetics http://www2.isye.gatech.edu/~brani/isyebayes/b ank/handout12.pdf http://en.wikipedia.org/wiki/Genetics http://www2.isye.gatech.edu/~brani/isyebayes/b ank/handout12.pdf 47
48
48 Example 1 in Genetics (3) MALE AB (1-r)/2 ab (1-r)/2 aB r/2 Ab r/2 FEMALEFEMALE AB (1-r ’ )/2 AABB (1-r) (1-r ’ )/4 aABb (1-r) (1-r ’ )/4 aABB r (1-r ’ )/4 AABb r (1-r ’ )/4 ab (1-r ’ )/2 AaBb (1-r) (1-r ’ )/4 aabb (1-r) (1-r ’ )/4 aaBb r (1-r ’ )/4 Aabb r (1-r ’ )/4 aB r ’ /2 AaBB (1-r) r ’ /4 aabB (1-r) r ’ /4 aaBB r r ’ /4 AabB r r ’ /4 Ab r ’ /2 AABb (1-r) r ’ /4 aAbb (1-r) r ’ /4 aABb r r ’ /4 AAbb r r ’ /4 48
49
49 Example 1 in Genetics (4) Four distinct phenotypes: A*B*, A*b*, a*B* and a*b*. A*: the dominant phenotype from (Aa, AA, aA). a*: the recessive phenotype from aa. B*: the dominant phenotype from (Bb, BB, bB). b* : the recessive phenotype from bb. A*B*: 9 gametic combinations. A*b*: 3 gametic combinations. a*B*: 3 gametic combinations. a*b*: 1 gametic combination. Total: 16 combinations. 49
50
50 Example 1 in Genetics (5) 50
51
51 Example 1 in Genetics (6) Hence, the random sample of n from the offspring of selfed heterozygotes will follow a multinomial distribution: 51
52
52 Bayesian for Example 1 in Genetics (1) To simplify computation, we let The random sample of n from the offspring of selfed heterozygotes will follow a multinomial distribution:
53
53 Bayesian for Example 1 in Genetics (2) If we assume a Dirichlet prior distribution with parameters: to estimate probabilities for A*B*, A*b*,a*B* and a*b*. Recall that A*B*: 9 gametic combinations. A*b*: 3 gametic combinations. a*B*: 3 gametic combinations. a*b*: 1 gametic combination. We consider
54
54 Bayesian for Example 1 in Genetics (3) Suppose that we observe the data of y = (y1, y2, y3, y4) = (125, 18, 20, 24). So the posterior distribution is also Dirichlet with parameters D(134, 21, 23, 25) The Bayesian estimation for probabilities are: =(0.660, 0.103, 0.113, 0.123)
55
55 Bayesian for Example 1 in Genetics (4) Consider the original model, The random sample of n also follow a multinomial distribution: We will assume a Beta prior distribution:
56
56 Bayesian for Example 1 in Genetics (5) The posterior distribution becomes The integration in the above denominator, does not have a close form.
57
57 Bayesian for Example 1 in Genetics (6) How to solve this problem? Monte Carlo Markov Chains (MCMC) Method! What value is appropriate for
58
58 Part 4 Monte Carlo Methods
59
59 Monte Carlo Methods (1) Consider the game of solitaire: what ’ s the chance of winning with a properly shuffled deck? http://en.wikipedia.or g/wiki/Monte_Carlo_ method http://en.wikipedia.or g/wiki/Monte_Carlo_ method http://nlp.stanford.ed u/local/talks/mcmc_2 004_07_01.ppt http://nlp.stanford.ed u/local/talks/mcmc_2 004_07_01.ppt ? Lose WinLose Chance of winning is 1 in 4!
60
60 Monte Carlo Methods (2) Hard to compute analytically because winning or losing depends on a complex procedure of reorganizing cards Insight: why not just play a few hands, and see empirically how many do in fact win? More generally, can approximate a probability density function using only samples from that density
61
61 Monte Carlo Methods (3) Given a very large set X and a distribution f(x) over it. We draw a set of N i.i.d. random samples. We can then approximate the distribution using these samples. X f(x)
62
62 Monte Carlo Methods (4) We can also use these samples to compute expectations: And even use them to find a maximum:
63
63 Monte Carlo Example , find Solution: Use Monte Carlo method to approximation
64
64 Exercises Write your own programs similar to those examples presented in this talk. Write programs for those examples mentioned at the reference web pages. Write programs for the other examples that you know. 64
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.