Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Is this spam?

Text Classification/Categorization Given:  A document, d  D.  A set of classes C = {c 1, c 2,…, c n }. Determine:  The class of d : c(d)  C, where c(d) is a classification function (“classifier”).

Classification Methods (1) Manual Classification  For example, Yahoo! Directory, DMOZ, Medline, etc.  Very accurate when job is done by experts.  Difficult to scale up.

Classification Methods (2) Hand-Coded Rules  For example, CIA, Reuters, SpamAssassin, etc.  Accuracy is often quite high, if the rules have been carefully refined over time by experts.  Expensive to build/maintain the rules.

Classification Methods (3) Machine Learning (ML)  For example Automatic Email Classification: PopFile  http://popfile.sourceforge.net/ http://popfile.sourceforge.net/ Automatic Webpage Classification: MindSet  http://mindset.research.yahoo.com/ http://mindset.research.yahoo.com/  There is no free lunch: hand-classified training data are required.  But the training data can be built up (and refined) easily by amateurs.

Text Classification via ML L Classifier U LearningPredicting Training Documents Test Documents

Training Data: Test Data: Classes: Text Classification via ML - Example MultimediaGUIGarb.Coll.Semantics ML Planning planning temporal reasoning plan language... programming semantics language proof... learning intelligence algorithm reinforcement network... garbage collection memory optimization region... “planning language proof intelligence” (AI)(Programming)(HCI)...

Evaluating Classification Classification Accuracy  The proportion of correct predictions Precision, Recall  F 1 (for each class)  macro-averaging: computes performance measure for each class, and then computes a simple average over classes.  micro-averaging: pools per-document predictions across classes, and then computes performance measure on the pooled contingency table.

Sample Learning Curve Yahoo Science Data

Bayesian Methods for Classification Before seeing the content of document d  Classify d to the class with maximum prior probability Pr[c].  For each class c j  C, Pr[c j ] could be estimated from the training data: N j : the number of documents in the class c j

Bayesian Methods for Classification After seeing the content of document d  Classify d to the class with maximum a posterio probability Pr[c|d].  For each class c j  C, Pr[c j |d] could be computed by the Bayes’ Theorem.

Bayes’ Theorem prior probability class-conditional probability a posterior probability a constant

Naïve Bayes: Classification as Pr[d] is a constant How can we compute Pr[d|c j ] ?

Naive Bayes Assumptions To facilitate the computation of Pr[d|c j ], two simplifying assumptions are made.  Conditional Independence Assumption Given the doc’s topic, word in one position tells us nothing about words in other positions.  Positional Independence Assumption Each doc as a bag-of-words: the occurrence of word does not depend on position. Then Pr[d|c j ] is given by the class-specific unigram language model  Essentially a multinomial distribution.

Unigram Language Model 0.2the 0.1a 0.01man 0.01woman 0.03said 0.02likes … Model for c j themanlikesthewoman 0.20.010.020.20.01 multiply

Naïve Bayes: Learning Given the training data  for each class c j  C estimate Pr[c j ] (as before) for each term w i in the vocabulary V  estimate Pr[w i |c j ] T ji : the number of occurrences of term i in documents of class c j

Smoothing Why not just use MLE? If a term w (in a test doc d) did not occur in the training data, Pr[w|c j ] would be 0, and then Pr[d|c j ] would be 0 no matter how strongly other terms in d are associated with class c j. Add-One (Laplace) Smoothing

Naïve Bayes is Not So Naïve Fairly Effective  The Bayes optimal classifier if the independence assumptions do hold.  Often performs well even if the independence assumptions are badly violated.  Usually yields highly accurate classification (though the estimated probabilities are not so accurate).  The 1st & 2nd place in KDD-CUP 97 competition, among 16 (then) state-of-the-art algorithms.  A good dependable baseline for text classification (though not the best).

Naïve Bayes is Not So Naïve Very Efficient  Linear time complexity for learning/classification.  Low storage requirements.

Take Home Messages Text Classification via Machine Learning Bayes’ Theorem Naïve Bayes Learning Classification

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Similar presentations

Presentation on theme: "Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Similar presentations

Presentation on theme: "Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang."— Presentation transcript:

Similar presentations

About project

Feedback