Download presentation
Presentation is loading. Please wait.
Published byAubrey Rogers Modified over 8 years ago
2
Bayesian Learning Evgueni Smirnov
3
Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers
4
Thomas Bayes (1702- 1761) Bayesian theory of probability was set out in 1764. His conclusions were accepted by Laplace in 1781, rediscovered by Condorcet, and remained unchallenged until Boole questioned them.
5
Goal: To determine the posterior probability P(h|D) of hypothesis h given the data D from: Prior probability of h, P(h): it reflects any background knowledge we have about the chance that h is a correct hypothesis (before having observed the data). Prior probability of D, P(D): it reflects the probability that training data D will be observed given no knowledge about which hypothesis h holds. Conditional Probability of observation D, P(D|h): it denotes the probability of observing data D given some world in which hypothesis h holds. Bayes Theorem
6
§Posterior probability of h, P(h|D): it represents the probability that h holds given the observed training data D. It reflects our confidence that h holds after we have seen the training data D and it is the quantity that Data-mining researchers are interested in. §Bayes Theorem allows us to compute P(h|D): Bayes Theorem
7
Maximum a Posteriori Hypothesis (MAP) In many learning scenarios, the learner considers a set of hypotheses H and is interested in finding the most probable hypothesis h H given the observed data D. Any such hypothesis is called maximum a posteriori hypothesis.
8
Consider a cancer test with two outcomes: positive [+] and negative [-]. The test returns a correct positive result in 98% of the cases in which the disease is actually present, and a correct negative result in 97% of the cases in which the disease is not present. Furthermore,.008 of all people have this cancer. P(cancer) = 0.008P( cancer) = 0.992 P([+] | cancer) = 0.98P([-] | cancer) = 0.02 P([+] | cancer) = 0.03P([-] | cancer) = 0.97 A patient got a positive test [+]. The maximum a posteriori hypothesis is: P([+] | cancer)P(cancer) = 0.98 x 0.008 = 0.0078 P([+] | cancer)P( cancer) = 0.03 x 0.992 = 0.0298 Example h MAP = cancer
9
Let each instance x of a training set D be described by a conjunction of n attribute values and we have a finite set V of possible classes (concepts). Naïve Bayes Classifier Naïve Bayes assumption is that attributes are conditionally independent!
10
Example Consider the weather data and we have to classify the instance: The task is to predict the value (yes or no) of the concept PlayTennis. We apply the naïve bayes rule:
11
Example: Estimating Probabilities Outlook P(sunny|yes) = 2/9P(sunny|no) = 3/5 P(overcast|yes) = 4/9P(overcast|no) = 0 P(rain|yes) = 3/9P(rain|no) = 2/5 Temp P(hot|yes) = 2/9P(hot|no) = 2/5 P(mild|yes) = 4/9P(mild|no) = 2/5 P(cool|yes) = 3/9P(cool|no) = 1/5 Hum P(high|yes) = 3/9P(high|no) = 4/5 P(normal|yes) = 6/9P(normal|no) = 2/5 Windy P(true|yes) = 3/9P(true|no) = 3/5 P(false|yes) = 6/9P(false|no) = 2/5 P(yes) = 9/14 P(no) = 5/14
12
Example Thus, the naïve Bayes classifier assigns the value no to PlayTennis!
13
–To estimate the probability P(A=v|C) of an attribute-value A = v for a given class C we use: Relative frequency: n c /n, where n c is the number of training instances that belong to the class C and have value v for the attribute A, and n is the number of training instances of the class C; m-estimate of accuracy: (n c + mp)/(n+m), n c /n, where n c is the number of training instances that belong to the class C and have value v for the attribute A, n is the number of training instances of the class C, p is the prior probablity of P(A=v), and m is the weight of p. Estimating Probabilities
14
Learning to Classify Text each document is represented by a vector of word numerical attributes w k ; the values of the word attributes w k are the frequencies the words occur in the text. To estimate the probability P(w k | v) we use: where n is the total number of word positions in all the documents (instances) whose target value is v, n k is the number of times word w k is found in these n word positions, and |Vocabulary| is the total number of distinct words found in the training data.
15
Summary Bayesian methods provide the basis for probabilistic learning methods that use knowledge about the prior probabilities of hypotheses and about the probability of observing data given the hypothesis; Bayesian methods can be used to determine the most probable hypothesis given the data; The naive Bayes classifier is useful in many practical applications.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.