1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks
2 Chapter 12 Contents l Probabilistic Reasoning l Joint Probability Distributions l Bayes’ Theorem l Simple Bayesian Concept Learning l Bayesian Belief Networks l The Noisy-V Function l Bayes’ Optimal Classifier l The Naïve Bayes Classifier l Collaborative Filtering
3 Probabilistic Reasoning l Probabilities are expressed in a notation similar to that of predicates in FOPC: nP(S) = 0.5 nP(T) = 1 nP(¬(A Λ B) V C) = 0.2 l 1 = certain; 0 = certainly not
4 Conditional Probability l Conditional probability refers to the probability of one thing given that we already know another to be true: l This states the probability of B, given A.
5 Joint Probability Distributions l A joint probability distribution represents the combined probabilities of two or more variables. l This table shows, for example, that P (A Λ B) = 0.11 P (¬A Λ B) = 0.09 l Using this, we can calculate P(A): P(A) = P(A Λ B) + P(A Λ ¬B) = = 0.74
6 Bayes’ Theorem l Bayes’ theorem lets us calculate a conditional probability: l P(B) is the prior probability of B. l P(B | A) is the posterior probability of B.
7 Simple Bayesian Concept Learning (1) l P (H|E) is used to represent the probability that some hypothesis, H, is true, given evidence E. l Let us suppose we have a set of hypotheses H 1 …H n. l For each H i l Hence, given a piece of evidence, a learner can determine which is the most likely explanation by finding the hypothesis that has the highest posterior probability.
8 Simple Bayesian Concept Learning (2) l In fact, this can be simplified. Since P(E) is independent of H i it will have the same value for each hypothesis. l Hence, it can be ignored, and we can find the hypothesis with the highest value of: l We can simplify this further if all the hypotheses are equally likely, in which case we simply seek the hypothesis with the highest value of P(E|H i ). This is the likelihood of E given H i.
9 Bayesian Belief Networks (1) l A belief network shows the dependencies between a group of variables. l If two variables A and B are independent if the likelihood that A will occur has nothing to do with whether B occurs. l C and D are dependent on A; D and E are dependent on B. The Bayesian belief network has probabilities associated with each link. E.g., P(C|A) = 0.2, P(C|¬A) = 0.4
10 Bayesian Belief Networks (2) l A complete set of probabilities for this belief network might be: n P(A) = 0.1 n P(B) = 0.7 n P(C|A) = 0.2 n P(C|¬A) = 0.4 n P(D|A Λ B) = 0.5 n P(D|A Λ ¬B) = 0.4 n P(D|¬A Λ B) = 0.2 n P(D|¬A Λ ¬B) = n P(E|B) = 0.2 n P(E|¬B) = 0.1
11 Bayesian Belief Networks (3) l We can now calculate conditional probabilities: l In fact, we can simplify this, since there are no dependencies between certain pairs of variables – between E and A, for example. Hence:
12 Bayes’ Optimal Classifier l A system that uses Bayes’ theory to classify data. l We have a piece of data y, and are seeking the correct hypothesis from H 1 … H 5, each of which assigns a classification to y. l The probability that y should be classified as c j is: l x 1 to x n are the training data, and m is the number of hypotheses. l This method provides the best possible classification for a piece of data.
13 The Naïve Bayes Classifier (1) l A vector of data is classified as a single classification. p(c i | d 1, …, d n ) l The classification with the highest posterior probability is chosen. l The hypothesis which has the highest posterior probability is the maximum a posteriori, or MAP hypothesis. l In this case, we are looking for the MAP classification. l Bayes’ theorem is used to find the posterior probability:
14 The Naïve Bayes Classifier (2) l since P(d 1, …, d n ) is a constant, independent of c i, we can eliminate it, and simply aim to find the classification c i, for which the following is maximised: l We now assume that all the attributes d 1, …, d n are independent l So P(d 1, …, d n |c i ) can be rewritten as: l The classification for which this is highest is chosen to classify the data.
15 Collaborative Filtering l A method that uses Bayesian reasoning to suggest items that a person might be interested in, based on their known interests. l if we know that Anne and Bob both like A, B and C, and that Anne likes D then we guess that Bob would also like D. l Can be calculated using decision trees: