Bayesian and Markov Test

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
What is Statistical Modeling
Probabilistic inference
Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Chapter 4 Probability See.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
Dr. Ahmed Abdelwahab Introduction for EE420. Probability Theory Probability theory is rooted in phenomena that can be modeled by an experiment with an.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Copyright © 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Review of Statistics I: Probability and Probability Distributions.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Conditional probability and Statistically Independent Events
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
PROBABILITY 1. Basic Terminology 2 Probability 3  Probability is the numerical measure of the likelihood that an event will occur  The probability.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Hidden Markov Models BMI/CS 576
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Applied statistics Usman Roshan.
Lecture 1.31 Criteria for optimal reception of radio signals.
Matt Gormley Lecture 3 September 7, 2016
Review of Probability.
Usman Roshan CS 675 Machine Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
Naive Bayes Classifier
Quick Review Probability Theory
Read R&N Ch Next lecture: Read R&N
CH 5: Multivariate Methods
Classification of unlabeled data:
A Survey of Probability Concepts
Data Mining Lecture 11.
Machine Learning. k-Nearest Neighbor Classifiers.
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Review of Probability and Estimators Arun Das, Jason Rebello
Read R&N Ch Next lecture: Read R&N
Advanced Artificial Intelligence
Covered only ML estimator
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Addition of Independent Normal Random Variables
CONTEXT DEPENDENT CLASSIFICATION
Lecture 2: Probability.
LECTURE 09: BAYESIAN LEARNING
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen
Machine Learning: Lecture 6
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Read R&N Ch Next lecture: Read R&N
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Bayesian and Markov Test

Naïve Bayesian Classifier Assumptions : The attributes are independent given the class. Called "Naïve" classifier because of these assumptions. Empirically proven to be useful. The classes are mutually exclusive and exhaustive. Scales very well.

Bayesian Test Suppose there are C={A_i, i=1,…,m} classes with sample space D= ∪ 𝑖 𝐴_𝑖 . Let A ∈ C. Then we have P(A|D)= 𝑃 𝐷 𝐴 𝑃(𝐴) 𝑖 𝑃 𝐷 𝐴 𝑖 𝑃( 𝐴 𝑖 )

Bayes Theorem P(h); prior probability of hypothesis h P(h|D) = 𝑃 𝐷 ℎ 𝑃(ℎ) 𝑃(𝐷) P(h); prior probability of hypothesis h P(D); prior probability of training data D P(h|D); probability of h for given training data D P(D|h); probability of D for given h

Choosing the most probable hypothesis h_0 For given data D we like to find the most probable a posterior hypothesis h such that h_0 = arg max P(h|D) = arg max 𝑃 𝐷 ℎ 𝑃(ℎ) 𝑃(𝐷) . If P(h) is constant, then h_0 = arg max P(D|h). In general we may maximize the numerate which is joint probability P(D,h).

Useful properties 1. conditional probability P(A∩B) = P(A|B)P(B) = P(B|A)P(A) 2. P(A∪B)=P(A)+P(B)-P(A∩B) 3. If A_1,A_2,…,A_n are mutually exclusive, then P(B) = 𝑖 𝑃 𝐵 𝐴_𝑖 𝑃( 𝐴 𝑖 ) , 𝑖 𝑃( 𝐴 𝑖 ) = 1.

Most problem in ML, we have a vector x=(x_1,x_2,…,x_d) and an instance space X= ∪ 𝑖 𝐴 𝑖 , in which A_i’s are classes. Problem; Find the most probable class A_k such that x 𝜖 A_k. Ans; We define hypotheses h_i that x 𝜖 A_i for each i. Then we need find k so that max 𝑃 𝐴_𝑘 𝑥 𝑃(𝑥) 𝑖 𝑃 𝐴 𝑖 𝑥 𝑃(𝑥) . The numerate is P(A_k,x)=P(x_1,x_2,…,x_d,A_k). We assume that the features x_i’s are independent.

By the chain rule of conditional probability we have P(x_1,x_2,…,x_d,A_k)=P(x_1|x_2,…,x_d,A_k) P(x_2,…,x_d,A_k) = P(x_1|x_2,…,x_d,A_k)P(x_2|x_3,…,x_d,A_k)P(x_3,…,x_d,A_k) =…=P(x_1|x_2,…,x_d,A_k)P(x_2|x_3,…,x_d,A_k) P(x_3|…,x_d,A_k)…P(x_d|,A_k)P(A_k). Condition independence assumption gives P(x_i|x_{i+1},…,x_d,A_k)=P(x_i|A_k) and so we have P(x_1,x_2,…,x_d,A_k)= P(x_1|A_k) P(x_2|A_k)… P(x_d|A_k) P(A_k).

Therefore we have P(A_k|x) = 𝑃( 𝐴 𝑘 ) 𝑃(𝑥) P(x_i|A_k) and find the minimizing k. P(x) is constant if data vector x is known.

Gaussian classifier Some times, we may have an attribute x is continuous and distributed normally. A class C has Gaussian distribution with mean μ_c and variance σ_c^2. If we have a test case v, then the conditional probability P(x=v|C) =1/ 2πσ_𝑐^2 exp⁡(−|𝑣−μ_c|^2/ 2σ_c^2).

Multinomial classifier First we consider drawing n balls from k different color balls with the probability p_i drawing i-th color. Then the probability to draw x_i balls of color i is p_1+p_2+…+p_k=1 x_1+…+x_k=n 𝑛! 𝑥 1 ! 𝑥 2 ! … 𝑥 𝑘 ! p_1^{x_1}…p_{k}^{x_k}. n=10, p_1=p_2=0.3, p_3=0.4

Multinomial naïve Bayesian classifier With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial ( p_1 , … , p_k ), where p_i is the probability that event i occurs (or K such multinomials in the multiclass case). A feature vector x = ( x_1 , … , x_n ) is then a histogram, with x_i counting the number of times event i was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption).

The likelihood of observing a histogram x is given by The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space.

If a given class and feature value never occur together in the training data, then the frequency-based probability estimate will be zero. This is problematic because it will wipe out all information in the other probabilities when they are multiplied. Therefore, it is often desirable to incorporate a small-sample correction, called pseudocount, in all probability estimates such that no probability is ever set to be exactly zero. This way of regularizing naive Bayes is called Laplace smoothing when the pseudocount is one, and Lidstone smoothing in the general case.

Bernoulli naïve Bayesian In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence features are used rather than term frequencies. If x_i is a boolean expressing the occurrence or absence of the i'th term from the vocabulary, then the likelihood of a document given a class C_k

Text classification using Bernoullli Bayesian classifier Let b be the feature vector for the k-th document D_k; then the i th element of b, written b_i, is either 0 or 1 representing the absence or presence of word w_i in the k-th document. Let P(w_i |C) be the probability of word w_i occurring in a document of class C; the probability of w_i not occurring in a document of this class is given by (1−P(w_ |C)). If we make the naive Bayes assumption, that the probability of each word occurring in the document is independent of the occurrences of the other words, then we can write the document likelihood P(D_k | C) in terms of the individual word likelihoods P(w_i |C); P(D_k|C)~P(b|c)= 𝑖 |𝑉| (b_iP(w_i|C)+(1-b_i)(1-P(w_i|C)).

This product goes over all words in the vocabulary This product goes over all words in the vocabulary. If word w_i is present, then b_i =1 and the required probability is P(w_i |C); if word w_i is not present, then b_i =0 and the required probability is 1− P(w_i |C). We can imagine this as a model for generating document feature vectors of class C, in which the document feature vector is modelled as a collection of |V| weighted coin tosses, the i th having a probability of success equal to P(w_i |C).

Comparison between multinomial and Bernoulli in text classifications(Hiroshi Shimodaira) 1. Underlying model of text: Bernoulli: a document can be thought of as being generated from a multidimensional Bernoulli distribution: the probability of a word being present can be thought of as a (weighted) coin flip with probability P(w_i |C). Multinomial: a document is formed by drawing words from a multinomial distribution: you can think of obtaining the next word in the document by rolling a (weighted) |V|-sided dice with probabilities P(w_i |C).

2. Document representation: Bernoulli: binary vector, elements indicating presence or absence of a word. Multinomial: integer vector, elements indicating frequency of occurrence of a word. 3. Multiple occurrences of words: Bernoulli: ignored. Multinomial: taken into account.

4. Behaviour with document length: Bernoulli: best for short documents. Multinomial: longer documents are OK. 5. Behaviour with “the”: Bernoulli: since “the” is present in almost every document, P(“the” |C) ~ 1.0. Multinomial: since probabilities are based on relative frequencies of word occurrence in a class, P(“the” |C) ~ 0.05.

Hidden Markov Bayesian classifier(Wikipedia) Hidden markov chain can be considered a submodel of Bayesian classifier for dynamic sequential data. The diagram below shows the general architecture of an instantiated HMM. Each oval shape represents a random variable that can adopt any of a number of values. The random variable x(t) is the hidden state at time t (with the model from the above diagram, x(t) ∈ { x1, x2, x3 }). The random variable y(t) is the observation at time t (with y(t) ∈ { y1, y2, y3, y4 }). The arrows in the diagram (often called a trellis diagram) denote conditional dependencies.

From the diagram, it is clear that the conditional probability distribution of the hidden variable x(t) at time t, given the values of the hidden variable x at all times, depends only on the value of the hidden variable x(t − 1); the values at time t − 2 and before have no influence. This is called the Markov property. Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t). Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of- speech tagging, musical score following, partial discharges and bioinformatics.

Thank you for your attentions