Naive Bayes Classifier Comp3710 Artificial Intelligence Computing Science Thompson Rivers University
Naive Bayes Classifier Course Outline Part I – Introduction to Artificial Intelligence Part II – Classical Artificial Intelligence Part III – Machine Learning Introduction to Machine Learning Neural Networks Probabilistic Reasoning and Bayesian Belief Networks Artificial Life: Learning through Emergent Behavior Part IV – Advanced Topics TRU-COMP3710 Naive Bayes Classifier
Learning Outcomes ... Finding a Maximum a Posteriori (MAP) hypothesis Use of naïve Bayes classifier Use of Bayes optimal classifier … Bayes Classifiers
Unit Outline Maximum a posteriori Naïve Bayes classifier Bayesian Networks
References ... Textbook Bayesian Learning – http://129.252.11.88/talks/bayesianlearning/ Bayes Classifiers
1. Maximum a Posteriori (MAP) View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h1, h2, … E.g., given a positive lab test result, having a cancer (h1) or not having a cancer (h2) ? Another example, my car won’t start. Is the starter bad? Is the fuel pump bad? We just need to know the chance having a cancer is higher than the chance not having a caner, and the chance having a bad starter is higher than the chance having a bad fuel pump. We do not need to compute the probabilities P(cancer | lab_test), P(~cancer | lab_test), P(bad_starter | wont_start), P(bad_fuel_pump | wont_start). Generally want the most probable hypothesis given the training data. Bayes Classifiers
H is the hypothesis variable, values h1, h2, … Generally want the most probable hypothesis given the training data D. The maximum a posteriori (MAP) hypothesis is If we assume that all hypotheses have the same priori probabilities, we can simplify even more and choose the maximum likelihood hypothesis. Bayes Classifiers
P(cancer|+)??? P(cancer|+)??? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the diseases is not present. Furthermore, .008 of the entire population have this cancer. [Q] If a new patient comes in with a positive test result, what is the chance that he has the cancer? P(cancer|+)??? P(cancer|+)??? [Q] So, among two hypotheses (cancer, cancer) given +, Then? Bayes Classifiers
So, among two hypotheses (cancer, cancer) given +, Topics [Q] If a new patient comes in with a positive test result, what is the chance that he has cancer? So, among two hypotheses (cancer, cancer) given +, Actually we don’t have to compute P(+) to decide hMAP. Bayes Classifiers
2. Naive Bayes Classifier When There is a large set of training examples. The attributes, that describe instances, are conditionally independent given classification. It has been used in many applications such as diagnosis and the classification of text documents. [Q] Why do we not use the k-nearest neighbor algorithm? Large set of training examples -> more computation of distances Not always possible to compute distances, especially when attributes are ordinal and categorical. Bayes Classifiers
[Q] E.g., given (2, 3, 4) in the table, classification? P(A) = ??? P(B) = ??? P(C) = ??? X d1 d2 d3 Class x1 2 3 A x2 4 1 B x3 x4 x5 x6 C x7 x8 x9 x10 x11 x12 x13 x14 x15 Bayes Classifiers
[Q] E.g., given (2, 3, 4) in the table, classification? P(A) = 8 / 15 P(B) = 4 / 15 P(C) = 3 / 15 X d1 d2 d3 Class x1 2 3 A x2 4 1 B x3 x4 x5 x6 C x7 x8 x9 x10 x11 x12 x13 x14 x15 Bayes Classifiers
[Q] E.g., given d = (2, 3, 4) in the table, classification? A vector of data is classified as a single classification. P(ci| d1, …, dn), where d = (d1, …, dn) The classification with the highest posterior probability is chosen. In this case, we are looking for the MAP classification. Since P(d1, …, dn) is a constant, independent of ci, we can eliminate it, and simply aim to find the classification ci, for which the following is maximised: We now assume that all the attributes d1, …, dn are independent so that cMAP can be rewritten as Bayes Classifiers Each attribute value
From the training data, [Q] For example of x1 = (2, 3, 2), Topics Each attribute value From the training data, P(A) = 8/15; P(B) = 4/15; P(C) = 3/15 [Q] For example of x1 = (2, 3, 2), P(A) × P(2|A) × P(3|A) × P(2|A) = 8/15 × 5/8 × 2/8 × 2/8 P(B) × P(2|B) × P(3|B) × P(2|B) = 4/15 × 1/4 × 1/4 × 0/4 P(C) × P(2|C) × P(3|C) × P(2|C) = 3/15 × 1/3 × 2/3 × 0/3 CMAP for x1 = A [Q] For example of y = (2, 3, 4) ??? [Q] For example of y = (4, 3, 2) ??? X d1 d2 d3 Class x1 2 3 A x2 4 1 B x3 x4 x5 x6 C x7 x8 x9 x10 x11 x12 x13 x14 x15 Bayes Classifiers
3. Bayes’ Optimal Classifier Given a new instance y, what is the most probable classification? [Q] For example, given three possible hypotheses with a training set X Two classes: +, – Given a new instance y, What is the most probable classification of y, + or –? Bayes Classifiers
The probability, that the new item of data, y, should be classified with classification cj, is defined by the following: P(cj | hi) means that hi says y is classified to cj with this probability. The optimal classification for y is the classification cj for which P(cj | X) is the highest. Bayes Classifiers
The hypothesis h1 defines y as +, which h2 and h3 define y as –. In our case, c1 = +, c2 = –. The hypothesis h1 defines y as +, which h2 and h3 define y as –. Hence The optimal classification for y is –. Bayes Classifiers
Another Example Suppose there are five kinds of bags of candies: h1: 100% cherry candies h2: 75% cherry candies + 25% lime candies h3: 50% cherry candies + 50% lime candies h4: 25% cherry candies + 75% lime candies h5: 100% lime candies 10%:h1, 20%:h2, 40%:h3, 20%:h4, and 10%:h5. Bayes Classifiers
Then we observe candies drawn from some bag: [Q] MAP hypothesis – What kind of bag is it? [Q] Bayes optimal classification – What flavor will the next candy be? P(l|d) = P(l|h1) P(h1|d) + P(l|h2) P(h2|d) + P(l|h3) P(h3|d) + P(l|h4) P(h4|d) + P(l|h5) P(h5|d) P(c|d) = … P(hi|d) = α P(d|hi)P(hi) = α P(l, l, l, l, l, l, l, l, l, l | hi) P(hi) = α P(l| hi,)10 P(hi) P(h4|d) = α P(l| h4,)10 P(h4) = α * .7510 * .2 Bayes Classifiers
Posterior probability of hypotheses P(h4|d) = P(d1,d2|h4)P(h4) / P(d1,d2) = 0.75 * 0.75 * 0.2 / (0.5 * 0.5) = 0.45 P(hi|d) = α P(d|hi)P(hi) Bayes Classifiers
Prediction probability Bayes Classifiers