Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014

Bayes’ Theorem  Thomas Bayes (1701-1761)  Simple form of Bayes’ Theorem, for two random variables C and X 2 X X C C Class prior probability Predictor prior probability or evidence Posterior probability Likelihood

Probability Model  Probability model: for a target class variable C which is dependent over features X 1,…, X n 3 The values of the features are given  So the denominator is effectively constant  Goal: calculating probabilities for the possible values of C  We are interested in the numerator:

Probability Model  The conditional probability is equivalent to the joint probability 4  Applying the chain rule for join probability … …

Strong Independence Assumption (Naïve)  Assume the features X 1, …, X n are conditionally independent given C – Given C, occurrence of X i does not influence the occurrence of X j, for i ≠ j. 5  Similarly,  Hence:

Naïve Bayes Probability Model 6 Known values: constant Class posterior probability

Classifier based on Naïve Bayes  Decision rule: pick the hypothesis (value c of C) that has highest probability – Maximum A-Posteriori (MAP) decision rule 7 Approximated from relative frequencies in the training set Approximated from frequency in the training set The values of features are known for the new observation

Text Classification with Naïve Bayes Example of Naïve Bayes Reference: The IR Book by Raghavan et al, Chapter 6 8

The Text Classification Problem  Set of class labels / tags / categories: C  Training set: set D of documents with labels ∈ D × C  Example: a document, and a class label  Set of all terms: V  Given d ∈ D’, a set of new documents, the goal is to find a class of the document c(d) 9

Multinomial Naïve Bayes Model  Probability of a document d being in class c 10 where P(t k |c) = probability of a term t k occurring in a document of class c Intuitively: P(t k |c) ~ how much evidence t k contributes to the class c P(c) ~ prior probability of a document being labeled by class c

Multinomial Naïve Bayes Model  The expression has many probabilities.  May result a floating point underflow. – Add logarithms of the probabilities instead of multiplying the original probability values 11 Term weight of t k in class c Todo: estimating these probabilities

Maximum Likelihood Estimate  Based on relative frequencies  Class prior probability 12 #of documents labeled as class c in training data total #of documents in training data  Estimate P(t|c) as the relative frequency of t occurring in documents labeled as class c total #of occurrences of all terms in documents d ∈ c total #of occurrences of t in documents d ∈ c

Handling Rare Events  What if: a term t did not occur in documents belonging to class c in the training data? – Quite common. Terms are sparse in documents.  Problem: P(t|c) becomes zero, the whole expression becomes zero  Use add-one or Laplace smoothing 13

Example 14 Indian Delhi Indian Delhi Indian Taj Mahal Indian Taj Mahal Indian Goa Indian Goa London Indian Embassy London Indian Embassy India UK Indian Embassy London Indian Embassy London Indian Classify

Bernoulli Naïve Bayes Model  Binary indicator of occurrence instead of frequency  Estimate P(t|c) as the fraction of documents in c containing the term t  Models absence of terms explicitly: 15 X i = 1 if t i is present 0 otherwise X i = 1 if t i is present 0 otherwise Absence of terms Difference between Multinomial with frequencies truncated to 1, and Bernoulli Naïve Bayes?

Example 16 Indian Delhi Indian Delhi Indian Taj Mahal Indian Taj Mahal Indian Goa Indian Goa London Indian Embassy London Indian Embassy India UK Indian Embassy London Indian Embassy London Indian Classify

Naïve Bayes as a Generative Model  The probability model: 17 UK X 1 = London X 1 = London X 2 = India X 2 = India X 3 = Embassy X 3 = Embassy Multinomial model where X i is the random variable for position i in the document – Takes values as terms of the vocabulary  Positional independence assumption  Bag of words model: Terms as they occur in d, exclude other terms

Naïve Bayes as a Generative Model  The probability model: 18 Bernoulli model P(U i =1|c) is the probability that term t i will occur in any position in a document of class c All terms in the vocabulary UK U london =1 U India = 1 U Embassy =1 U TajMahal =0 U delhi =0 U goa =0

Multinomial vs Bernoulli MultinomialBernoulli Event ModelGeneration of tokenGeneration of document Multiple occurrencesMattersDoes not matter Length of documentsBetter for larger documents Better for shorter documents #FeaturesCan handle moreWorks best with fewer 19

On Naïve Bayes  Text classification – Spam filtering (email classification) [Sahami et al. 1998] – Adapted by many commercial spam filters – SpamAssassin, SpamBayes, CRM114, …  Simple: the conditional independence assumption is very strong (naïve) – Naïve Bayes may not estimate right in many cases, but ends up classifying correctly quite often – It is difficult to understand the dependencies between features in real problems 20

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Similar presentations

Presentation on theme: "Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Similar presentations

Presentation on theme: "Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014."— Presentation transcript:

Similar presentations

About project

Feedback