Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Similar presentations


Presentation on theme: "Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

1 Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

2 Is this spam?

3 Text Classification/Categorization Given:  A document, d  D.  A set of classes C = {c 1, c 2,…, c n }. Determine:  The class of d : c(d)  C, where c(d) is a classification function (“classifier”).

4 Classification Methods (1) Manual Classification  For example, Yahoo! Directory, DMOZ, Medline, etc.  Very accurate when job is done by experts.  Difficult to scale up.

5 Classification Methods (2) Hand-Coded Rules  For example, CIA, Reuters, SpamAssassin, etc.  Accuracy is often quite high, if the rules have been carefully refined over time by experts.  Expensive to build/maintain the rules.

6 Classification Methods (3) Machine Learning (ML)  For example Automatic Email Classification: PopFile  http://popfile.sourceforge.net/ http://popfile.sourceforge.net/ Automatic Webpage Classification: MindSet  http://mindset.research.yahoo.com/ http://mindset.research.yahoo.com/  There is no free lunch: hand-classified training data are required.  But the training data can be built up (and refined) easily by amateurs.

7 Text Classification via ML L Classifier U LearningPredicting Training Documents Test Documents

8 Training Data: Test Data: Classes: Text Classification via ML - Example MultimediaGUIGarb.Coll.Semantics ML Planning planning temporal reasoning plan language... programming semantics language proof... learning intelligence algorithm reinforcement network... garbage collection memory optimization region... “planning language proof intelligence” (AI)(Programming)(HCI)...

9 Evaluating Classification Classification Accuracy  The proportion of correct predictions Precision, Recall  F 1 (for each class)  macro-averaging: computes performance measure for each class, and then computes a simple average over classes.  micro-averaging: pools per-document predictions across classes, and then computes performance measure on the pooled contingency table.

10 Sample Learning Curve Yahoo Science Data

11 Bayesian Methods for Classification Before seeing the content of document d  Classify d to the class with maximum prior probability Pr[c].  For each class c j  C, Pr[c j ] could be estimated from the training data: N j : the number of documents in the class c j

12 Bayesian Methods for Classification After seeing the content of document d  Classify d to the class with maximum a posterio probability Pr[c|d].  For each class c j  C, Pr[c j |d] could be computed by the Bayes’ Theorem.

13 Bayes’ Theorem prior probability class-conditional probability a posterior probability a constant

14 Naïve Bayes: Classification as Pr[d] is a constant How can we compute Pr[d|c j ] ?

15 Naive Bayes Assumptions To facilitate the computation of Pr[d|c j ], two simplifying assumptions are made.  Conditional Independence Assumption Given the doc’s topic, word in one position tells us nothing about words in other positions.  Positional Independence Assumption Each doc as a bag-of-words: the occurrence of word does not depend on position. Then Pr[d|c j ] is given by the class-specific unigram language model  Essentially a multinomial distribution.

16 Unigram Language Model 0.2the 0.1a 0.01man 0.01woman 0.03said 0.02likes … Model for c j themanlikesthewoman 0.20.010.020.20.01 multiply

17 Naïve Bayes: Learning Given the training data  for each class c j  C estimate Pr[c j ] (as before) for each term w i in the vocabulary V  estimate Pr[w i |c j ] T ji : the number of occurrences of term i in documents of class c j

18 Smoothing Why not just use MLE? If a term w (in a test doc d) did not occur in the training data, Pr[w|c j ] would be 0, and then Pr[d|c j ] would be 0 no matter how strongly other terms in d are associated with class c j. Add-One (Laplace) Smoothing

19 Naïve Bayes is Not So Naïve Fairly Effective  The Bayes optimal classifier if the independence assumptions do hold.  Often performs well even if the independence assumptions are badly violated.  Usually yields highly accurate classification (though the estimated probabilities are not so accurate).  The 1st & 2nd place in KDD-CUP 97 competition, among 16 (then) state-of-the-art algorithms.  A good dependable baseline for text classification (though not the best).

20 Naïve Bayes is Not So Naïve Very Efficient  Linear time complexity for learning/classification.  Low storage requirements.

21 Take Home Messages Text Classification via Machine Learning Bayes’ Theorem Naïve Bayes Learning Classification


Download ppt "Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang."

Similar presentations


Ads by Google