Download presentation
Presentation is loading. Please wait.
1
Naive Bayes for Document Classification
Illustrative Example
2
Document Classification
Given a document, find its class (e.g. headlines, sports, economics, fashion…) We assume the document is a “bag-of-words”. d ~ { t1, t2, t3, … tnd } Using Naive Bayes with multinomial distribution:
3
Binomial Distribution
n independent trials (a Bernouilli trial), each of which results in success with probability of p binomial distribution gives the probability of any particular combination of numbers of successes for the two categories. e.g. You flip a coin 10 times with PHeads=0.6 What is the probability of getting 8 H, 2T? P(k) = with k being number of successes (or to see the similarity with multinomial, consider first class is selected k times, ...)
4
Multinomial Distribution
Generalization of Binomial distribution n independent trials, each of which results in one of the k outcomes. multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories k. e.g. You have balls in three colours in a bin (3 balls of each color => pR=PG=PB), from which you draw n=9 balls with replacement. What is the probability of getting 8 Red, 1 Green, 0 Blue. P(x1,x2,x3) =
5
Naive Bayes w/ Multinomial Model
from McCallum and Nigam, 1995 Advanced
6
Naive Bayes w/ Multivariate Binomial
from McCallum and Nigam, 1995 Advanced
7
Smoothing For each term, t, we need to estimate P(t|c)
Tct is the count of term t in all documents of class c 7
8
Smoothing Because an estimate will be 0 if a term does not appear with a class in the training data, we need smoothing: Laplace Smoothing |V| is the number of terms in the vocabulary 8
9
Two topic classes: “China”, “not China”
Training set docID c = China? 1 Chinese Beijing Chinese Yes 2 Chinese Chinese Shangai 3 Chinese Macao 4 Tokyo Japan Chinese No Test set 5 Chinese Chinese Chinese Tokyo Japan ? Two topic classes: “China”, “not China” V = {Beijing, Chinese, Japan, Macao, Tokyo, Shangai} N = 4 9
10
Probability Estimation Classification
Training set docID c = China? 1 Chinese Beijing Chinese Yes 2 Chinese Chinese Shangai 3 Chinese Macao 4 Tokyo Japan Chinese No Test set 5 Chinese Chinese Chinese Tokyo Japan ? Probability Estimation Classification 10
11
Summary: Miscellaneous
Naïve Bayes is linear in the time is takes to scan the data When we have many terms, the product of probabilities with cause a floating point underflow, therefore: For a large training set, the vocabulary is large. It is better to select only a subset of terms. For that is used “feature selection”. However, accuracy is not badly affected by irrelevant attributes, if data is large. 11 11
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.