Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian and Markov Test

Similar presentations


Presentation on theme: "Bayesian and Markov Test"— Presentation transcript:

1 Bayesian and Markov Test

2 Naïve Bayesian Classifier
Assumptions : The attributes are independent given the class. Called "Naïve" classifier because of these assumptions. Empirically proven to be useful. The classes are mutually exclusive and exhaustive. Scales very well.

3 Bayesian Test Suppose there are C={A_i, i=1,…,m} classes with sample space D= ∪ 𝑖 𝐴_𝑖 . Let A ∈ C. Then we have P(A|D)= 𝑃 𝐷 𝐴 𝑃(𝐴) 𝑖 𝑃 𝐷 𝐴 𝑖 𝑃( 𝐴 𝑖 )

4 Bayes Theorem P(h); prior probability of hypothesis h
P(h|D) = 𝑃 𝐷 ℎ 𝑃(ℎ) 𝑃(𝐷) P(h); prior probability of hypothesis h P(D); prior probability of training data D P(h|D); probability of h for given training data D P(D|h); probability of D for given h

5 Choosing the most probable hypothesis h_0
For given data D we like to find the most probable a posterior hypothesis h such that h_0 = arg max P(h|D) = arg max 𝑃 𝐷 ℎ 𝑃(ℎ) 𝑃(𝐷) . If P(h) is constant, then h_0 = arg max P(D|h). In general we may maximize the numerate which is joint probability P(D,h).

6 Useful properties 1. conditional probability P(A∩B) = P(A|B)P(B) = P(B|A)P(A) 2. P(A∪B)=P(A)+P(B)-P(A∩B) 3. If A_1,A_2,…,A_n are mutually exclusive, then P(B) = 𝑖 𝑃 𝐵 𝐴_𝑖 𝑃( 𝐴 𝑖 ) , 𝑖 𝑃( 𝐴 𝑖 ) = 1.

7 Most problem in ML, we have a vector x=(x_1,x_2,…,x_d) and an instance space X= ∪ 𝑖 𝐴 𝑖 , in which A_i’s are classes. Problem; Find the most probable class A_k such that x 𝜖 A_k. Ans; We define hypotheses h_i that x 𝜖 A_i for each i. Then we need find k so that max 𝑃 𝐴_𝑘 𝑥 𝑃(𝑥) 𝑖 𝑃 𝐴 𝑖 𝑥 𝑃(𝑥) . The numerate is P(A_k,x)=P(x_1,x_2,…,x_d,A_k). We assume that the features x_i’s are independent.

8 By the chain rule of conditional probability we have
P(x_1,x_2,…,x_d,A_k)=P(x_1|x_2,…,x_d,A_k) P(x_2,…,x_d,A_k) = P(x_1|x_2,…,x_d,A_k)P(x_2|x_3,…,x_d,A_k)P(x_3,…,x_d,A_k) =…=P(x_1|x_2,…,x_d,A_k)P(x_2|x_3,…,x_d,A_k) P(x_3|…,x_d,A_k)…P(x_d|,A_k)P(A_k). Condition independence assumption gives P(x_i|x_{i+1},…,x_d,A_k)=P(x_i|A_k) and so we have P(x_1,x_2,…,x_d,A_k)= P(x_1|A_k) P(x_2|A_k)… P(x_d|A_k) P(A_k).

9 Therefore we have P(A_k|x) = 𝑃( 𝐴 𝑘 ) 𝑃(𝑥) P(x_i|A_k) and find the minimizing k. P(x) is constant if data vector x is known.

10 Gaussian classifier Some times, we may have an attribute x is continuous and distributed normally. A class C has Gaussian distribution with mean μ_c and variance σ_c^2. If we have a test case v, then the conditional probability P(x=v|C) =1/ 2πσ_𝑐^2 exp⁡(−|𝑣−μ_c|^2/ 2σ_c^2).

11 Multinomial classifier
First we consider drawing n balls from k different color balls with the probability p_i drawing i-th color. Then the probability to draw x_i balls of color i is p_1+p_2+…+p_k=1 x_1+…+x_k=n 𝑛! 𝑥 1 ! 𝑥 2 ! … 𝑥 𝑘 ! p_1^{x_1}…p_{k}^{x_k}. n=10, p_1=p_2=0.3, p_3=0.4

12 Multinomial naïve Bayesian classifier
With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial ( p_1 , … , p_k ), where p_i is the probability that event i occurs (or K such multinomials in the multiclass case). A feature vector x = ( x_1 , … , x_n ) is then a histogram, with x_i counting the number of times event i was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption).

13 The likelihood of observing a histogram x is given by The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space.

14 If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called pseudocount, in all probability estimates such that no probability is ever set to be exactly zero. This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one, and Lidstone smoothing in the general case.

15 Bernoulli naïve Bayesian
In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence features are used rather than term frequencies. If x_i is a boolean expressing the occurrence or absence of the i'th term from the vocabulary, then the likelihood of a document given a class C_k

16 Text classification using Bernoullli Bayesian classifier
Let b be the feature vector for the k-th document D_k; then the i th element of b, written b_i, is either 0 or 1 representing the absence or presence of word w_i in the k-th document. Let P(w_i |C) be the probability of word w_i occurring in a document of class C; the probability of w_i not occurring in a document of this class is given by (1−P(w_ |C)). If we make the naive Bayes assumption, that the probability of each word occurring in the document is independent of the occurrences of the other words, then we can write the document likelihood P(D_k | C) in terms of the individual word likelihoods P(w_i |C); P(D_k|C)~P(b|c)= 𝑖 |𝑉| (b_iP(w_i|C)+(1-b_i)(1-P(w_i|C)).

17 This product goes over all words in the vocabulary
This product goes over all words in the vocabulary. If word w_i is present, then b_i =1 and the required probability is P(w_i |C); if word w_i is not present, then b_i =0 and the required probability is 1− P(w_i |C). We can imagine this as a model for generating document feature vectors of class C, in which the document feature vector is modelled as a collection of |V| weighted coin tosses, the i th having a probability of success equal to P(w_i |C).

18 Comparison between multinomial and Bernoulli in text classifications(Hiroshi Shimodaira)
1. Underlying model of text: Bernoulli: a document can be thought of as being generated from a multidimensional Bernoulli distribution: the probability of a word being present can be thought of as a (weighted) coin flip with probability P(w_i |C). Multinomial: a document is formed by drawing words from a multinomial distribution: you can think of obtaining the next word in the document by rolling a (weighted) |V|-sided dice with probabilities P(w_i |C).

19 2. Document representation:
Bernoulli: binary vector, elements indicating presence or absence of a word. Multinomial: integer vector, elements indicating frequency of occurrence of a word. 3. Multiple occurrences of words: Bernoulli: ignored. Multinomial: taken into account.

20 4. Behaviour with document length:
Bernoulli: best for short documents. Multinomial: longer documents are OK. 5. Behaviour with “the”: Bernoulli: since “the” is present in almost every document, P(“the” |C) ~ 1.0. Multinomial: since probabilities are based on relative frequencies of word occurrence in a class, P(“the” |C) ~ 0.05.

21 Hidden Markov Bayesian classifier(Wikipedia)
Hidden markov chain can be considered a submodel of Bayesian classifier for dynamic sequential data. The diagram below shows the general architecture of an instantiated HMM. Each oval shape represents a random variable that can adopt any of a number of values. The random variable x(t) is the hidden state at time t (with the model from the above diagram, x(t) ∈ { x1, x2, x3 }). The random variable y(t) is the observation at time t (with y(t) ∈ { y1, y2, y3, y4 }). The arrows in the diagram (often called a trellis diagram) denote conditional dependencies.

22 From the diagram, it is clear that the conditional probability distribution of the hidden variable x(t) at time t, given the values of the hidden variable x at all times, depends only on the value of the hidden variable x(t − 1); the values at time t − 2 and before have no influence. This is called the Markov property. Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t). Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of- speech tagging, musical score following, partial discharges and bioinformatics.

23 Thank you for your attentions

24

25

26

27

28


Download ppt "Bayesian and Markov Test"

Similar presentations


Ads by Google