Text Classification The Naïve Bayes algorithm IP notice: most slides from: Chris Manning, plus some from William Cohen, Chien Chin Chen, Jason Eisner,

Text Classification The Naïve Bayes algorithm IP notice: most slides from: Chris Manning, plus some from William Cohen, Chien Chin Chen, Jason Eisner, David Yarowsky, Dan Jurafsky, P. Nakov, Marti Hearst, Barbara Rosario

Outline Introduction to Text Classification Introduction to Text Classification  Also called “text categorization” Naïve Bayes text classification Naïve Bayes text classification

Is this spam?

More Applications of Text Classification Authorship identification Authorship identification Age/gender identification Age/gender identification Language Identification Language Identification Assigning topics such as Yahoo-categories Assigning topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" Genre-detection Genre-detection e.g., "editorials" "movie-reviews" "news“ Opinion/sentiment analysis on a person/product Opinion/sentiment analysis on a person/product e.g., “like”, “hate”, “neutral” Labels may be domain-specific Labels may be domain-specific e.g., “contains adult language” : “doesn’t”

Text Classification: definition The classifier: The classifier:  Input: a document d  Output: a predicted class c from some fixed set of labels c 1,...,c K The learner: The learner:  Input: a set of m hand-labeled documents (d 1,c 1 ),....,(d m,c m )  Output: a learned classifier f:d → c Slide from William Cohen

MultimediaGUIGarb.Coll.Semantics ML Planning planning temporal reasoning plan language... programming semantics language proof... learning intelligence algorithm reinforcement network... garbage collection memory optimization region... “planning language proof intelligence” Training Data: Test Data: Classes: (AI) Document Classification Slide from Chris Manning (Programming)(HCI)...

Classification Methods: Hand-coded rules Some spam/email filters, etc. Some spam/email filters, etc. E.g., assign category if document contains a given boolean combination of words E.g., assign category if document contains a given boolean combination of words Accuracy is often very high if a rule has been carefully refined over time by a subject expert Accuracy is often very high if a rule has been carefully refined over time by a subject expert Building and maintaining these rules is expensive Building and maintaining these rules is expensive Slide from Chris Manning

Classification Methods: Machine Learning Supervised Machine Learning Supervised Machine Learning To learn a function from documents (or sentences) to labels To learn a function from documents (or sentences) to labels  Naive Bayes (simple, common method)  Others k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) … plus many other methods  No free lunch: requires hand-classified training data But data can be built up (and refined) by amateurs Slide from Chris Manning

Naïve Bayes Intuition

Representing text for classification Slide from William Cohen ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil). Sorghum nil (nil) Oilseed export registrations were: Sunflowerseed total 15.0 (7.9) Soybean May 20.0, total 20.0 (nil) The board also detailed export registrations for subproducts, as follows.... f()=c ? What is the best representation for the document d being classified? simplest useful

Bag of words representation Slide from William Cohen ARGENTINE 1986/87 GRAIN / OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil). Sorghum nil (nil) Oilseed export registrations were: Sunflowerseed total 15.0 (7.9) Soybean May 20.0, total 20.0 (nil) The board also detailed export registrations for subproducts, as follows.... Categories: grain, wheat

Bag of words representation xxxxxxxxxxxxxxxxxxx GRAIN / OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx grain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes, xxxxxxxxxxxxxxxxx shipments xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx: Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx Maize xxxxxxxxxxxxxxxxx Sorghum xxxxxxxxxx Oilseed xxxxxxxxxxxxxxxxxxxxx Sunflowerseed xxxxxxxxxxxxxx Soybean xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.... Categories: grain, wheat Slide from William Cohen

Bag of words representation xxxxxxxxxxxxxxxxxxx GRAIN / OILSEED xxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxx grain xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx grains, oilseeds xxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx tonnes, xxxxxxxxxxxxxxxxx shipments xxxxxxxxxxxx total xxxxxxxxx total xxxxxxxx xxxxxxxxxxxxxxxxxxxx: Xxxxx wheat xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, total xxxxxxxxxxxxxxxx Maize xxxxxxxxxxxxxxxxx Sorghum xxxxxxxxxx Oilseed xxxxxxxxxxxxxxxxxxxxx Sunflowerseed xxxxxxxxxxxxxx Soybean xxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.... Categories: grain, wheat grain(s)3 oilseed(s)2 total3 wheat1 maize1 soybean1 tonnes1... wordfreq Slide from William Cohen

Formalizing Naïve Bayes

Bayes’ Rule Allows us to swap the conditioning Sometimes easier to estimate one kind of dependence than the other

S Conditional Probability let A and B be events let A and B be events P(B|A) = the probability of event B occurring given event A occurs P(B|A) = the probability of event B occurring given event A occurs definition: P(B|A) = P(A  B) / P(A) definition: P(B|A) = P(A  B) / P(A)

Deriving Bayes’ Rule

Bayes’ Rule Applied to Documents and Classes Slide from Chris Manning

The Text Classification Problem Using a supervised learning method, we want to learn a classifier (or classification function):  Using a supervised learning method, we want to learn a classifier (or classification function):  We denote the supervised learning method by  : We denote the supervised learning method by  :  (T) =   The learning method  takes the training set T as input and returns the learned classifier . Once we have learned , we can apply it to the test set (or test data). Once we have learned , we can apply it to the test set (or test data). Slide from Chien Chin Chen

Naïve Bayes Text Classification The Multinomial Naïve Bayes model (NB) is a probabilistic learning method. The Multinomial Naïve Bayes model (NB) is a probabilistic learning method. In text classification, our goal is to find the “best” class for the document: In text classification, our goal is to find the “best” class for the document: Slide from Chien Chin Chen The probability of a document d being in class c. The probability of a document d being in class c. Bayes’ Rule We can ignore the denominator

Naive Bayes Classifiers We represent an instance D based on some attributes. Task: Classify a new instance D based on a tuple of attribute values into one of the classes c j  C Slide from Chris Manning The probability of a document d being in class c. The probability of a document d being in class c. Bayes’ Rule We can ignore the denominator

Naïve Bayes Classifier: Naïve Bayes Assumption P(c j ) P(c j )  Can be estimated from the frequency of classes in the training examples. P(x 1,x 2,…,x n |c j ) P(x 1,x 2,…,x n |c j )  O(|X| n|C|) parameters  Could only be estimated if a very, very large number of training examples was available. Naïve Bayes Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(x i |c j ). Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(x i |c j ). Slide from Chris Manning

Flu X1X1 X2X2 X5X5 X3X3 X4X4 feversinuscoughrunnynosemuscle-ache The Naïve Bayes Classifier Conditional Independence Assumption: Conditional Independence Assumption:  features are independent of each other given the class: Slide from Chris Manning

Using Multinomial Naive Bayes Classifiers to Classify Text Attributes are text positions, values are words. Attributes are text positions, values are words. Still too many possibilities Assume that classification is independent of the positions of the words Use same parameters for each position Result is bag of words model (over tokens not types) Slide from Chris Manning

Learning the Model Simplest: maximum likelihood estimate Simplest: maximum likelihood estimate  simply use the frequencies in the data C X1X1 X2X2 X5X5 X3X3 X4X4 X6X6 Slide from Chris Manning

Problem with Max Likelihood What if we have seen no training cases where patient had no flu and muscle aches? What if we have seen no training cases where patient had no flu and muscle aches? Zero probabilities cannot be conditioned away, no matter the other evidence! Zero probabilities cannot be conditioned away, no matter the other evidence! Flu X1X1 X2X2 X5X5 X3X3 X4X4 feversinuscoughrunnynosemuscle-ache Slide from Chris Manning

Smoothing to Avoid Overfitting Bayesian Unigram Prior: Bayesian Unigram Prior: Slide from Chris Manning # of values of X i overall fraction in data where X i =x i,k extent of “smoothing” Laplace:

Text j  single document containing all docs j for each word w k in Vocabulary n kj  number of occurrences of w k in Text j n k  number of occurrences of w k in all docs Naïve Bayes: Learning From training corpus, extract Vocabulary From training corpus, extract Vocabulary Calculate required P(c j ) and P(w k | c j ) terms Calculate required P(c j ) and P(w k | c j ) terms  For each c j in C do docs j  subset of documents for which the target class is c j Slide from Chris Manning

Naïve Bayes: Classifying positions  all word positions in current document which contain tokens found in Vocabulary positions  all word positions in current document which contain tokens found in Vocabulary Return c NB, where Return c NB, where Slide from Chris Manning

Underflow Prevention: log space Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability score is still the most probable. Class with highest final un-normalized log probability score is still the most probable. Note that model is now just max of sum of weights… Note that model is now just max of sum of weights… Slide from Chris Manning

Naïve Bayes Generative Model for Text nude deal Nigeria spam ham hot $ Viagra lottery !! ! win Friday exam computer May PM test March science Viagra homework score ! spam ham spam ham spam ham spam Category Viagra deal hot !! Slide from Ray Mooney Choose a class c according to P(c) Then choose a word from that class with probability P(x|c) Essentially model probability of each class as class-specific unigram language model

Naïve Bayes and Language Modeling Naïve Bayes classifiers can use any sort of features Naïve Bayes classifiers can use any sort of features  URL, email address, dictionary But, if: But, if:  We use only word features  We use all of the words in the text (not subset) Then Then  Naïve Bayes bears similarity to language modeling

Naïve Bayes Language Model Two classes: in language, out language Two classes: in language, out language In Language I0.1 love0.1 this0.01 fun0.05 film0.1 Out Language I0.2 love0.001 this0.01 fun0.005 film0.1 Ilovethisfunfilm 0.1 0.050.010.1 0.20.0010.010.0050.1 P(s | in) = P(s | out)

Naïve Bayes Classification nude deal Nigeria spam ham hot $ Viagra lottery !! ! win Friday exam computer May PM test March science Viagra homework score ! spam ham spam ham spam ham spam Category Win lotttery $ ! ?? Slide from Ray Mooney

Naïve Bayes Text Classification Example Training: Training:  Vocabulary V = {Chinese, Beijing, Shanghai, Macao, Tokyo, Japan} and |V | = 6.  P(c) = 3/4 and P(~c) = 1/4.  P(Chinese|c) = (5+1) / (8+6) = 6/14 = 3/7  P(Chinese|~c) = (1+1) / (3+6) = 2/9  P(Tokyo|c) = P(Japan|c) = (0+1)/(8+6) =1/14  P(Chinese|~c) = (1+1)/(3+6) = 2/9  P(Tokyo|~c) = p(Japan|~c) = (1+1/)3+6) = 2/9 Testing: Testing:  P(c|d) 3/4 * (3/7) 3 * 1/14 * 1/14 ≈ 0.0003  P(~c|d) 1/4 * (2/9) 3 * 2/9 * 2/9 ≈ 0.0001 Slide from Chien Chin Chen SetDocWordsClass Train1Chinese Bejing Chinesec 2Chinese Chinese Shanghaic 3Chinese Macaoc 4Tokyo Japan Chinese~c Test5Chinese Chinese Chinese Tokyo Japan?

Naïve Bayes Text Classification Naïve Bayes algorithm – training phase. Naïve Bayes algorithm – training phase. Slide from Chien Chin Chen TrainMultinomialNB (C, D) V  ExtractVocabulary (D) N  CountDocs (D) for each c in C N c  CountDocsInClass (D, c) prior[c]  N c / Count (C) text c  TextOfAllDocsInClass (D, c) for each t in V F tc  CountOccurrencesOfTerm (t, text c ) for each t in V condprob[t][c]  (F tc +1) / ∑(F t’c +1) return V, prior, condprob TrainMultinomialNB (C, D) V  ExtractVocabulary (D) N  CountDocs (D) for each c in C N c  CountDocsInClass (D, c) prior[c]  N c / Count (C) text c  TextOfAllDocsInClass (D, c) for each t in V F tc  CountOccurrencesOfTerm (t, text c ) for each t in V condprob[t][c]  (F tc +1) / ∑(F t’c +1) return V, prior, condprob

Naïve Bayes Text Classification Naïve Bayes algorithm – testing phase. Naïve Bayes algorithm – testing phase. Slide from Chien Chin Chen ApplyMultinomialNB (C, V, prior, condProb, d) W  ExtractTokensFromDoc (V, d) for each c in C score[c]  log prior[c] for each t in W score[c] += log condprob[t][c] return argmax c score[c] ApplyMultinomialNB (C, V, prior, condProb, d) W  ExtractTokensFromDoc (V, d) for each c in C score[c]  log prior[c] for each t in W score[c] += log condprob[t][c] return argmax c score[c]

Evaluating Categorization Evaluation must be done on test data that are independent of the training data Evaluation must be done on test data that are independent of the training data  usually a disjoint set of instances Classification accuracy: c / n where n is the total number of test instances and c is the number of test instances correctly classified by the system. Classification accuracy: c / n where n is the total number of test instances and c is the number of test instances correctly classified by the system.  Adequate if one class per document Results can vary based on sampling error due to different training and test sets. Results can vary based on sampling error due to different training and test sets.  Average results over multiple training and test sets (splits of the overall data) for the best results. Slide from Chris Manning

Measuring Performance Precision = good messages kept all messages kept Precision = good messages kept all messages kept Recall = good messages kept all good messages Recall = good messages kept all good messages Trade off precision vs. recall by setting threshold Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable Trade off precision vs. recall by setting threshold Measure the curve on annotated dev data (or test data) Choose a threshold where user is comfortable Slide from Jason Eisner

Measuring Performance Slide from Jason Eisner low threshold: keep all the good stuff, but a lot of the bad too high threshold: all we keep is good, but we don’t keep much OK for spam filtering and legal search OK for search engines (maybe) would prefer to be here! point where precision=recall (often reported)

The 2-by-2 contingency table CorrectIncorrect SelectedTrue PositiveFalse Positive Not selectedFalse NegativeTrue Negative

Precision and Recall Precision: % of selected items that are correct Precision: % of selected items that are correct Recall: % of correct items that are selected Recall: % of correct items that are selected

A Combined measure: F The F measure assesses the P/R tradeoff, through the weighted harmonic mean: The F measure assesses the P/R tradeoff, through the weighted harmonic mean:

Multiclass Classification Dealing with > 2 classes Dealing with > 2 classes For each class c For each class c  Build a binary classifier Y c to distinguish c from ~c Given a test document d Given a test document d  Evaluate membership in each class using Y c  Assign d to each class c for which Y c returns true

Micro- vs. Macro-Averaging If we have more than one class, how do we combine multiple performance measures into one quantity If we have more than one class, how do we combine multiple performance measures into one quantity Macroaveraging: compute performance for each class, then average Macroaveraging: compute performance for each class, then average Microaveraging: collect decision for all classes, compute contingency table, evaluate Microaveraging: collect decision for all classes, compute contingency table, evaluate

More Complicated Cases of Measuring Performance For multiclass classifiers: For multiclass classifiers:  Average accuracy (or precision or recall) of 2-way distinctions: Sports or not, News or not, etc.  Better, estimate the cost of different kinds of errors e.g., how bad is each of the following? –putting Sports articles in the News section –putting Fashion articles in the News section –putting News articles in the Fashion section Now tune system to minimize total cost For ranking systems: For ranking systems:  Correlate with human rankings?  Get active feedback from user?  Measure user’s wasted time by tracking clicks? Slide from Jason Eisner Which articles are most Sports-like? Which articles / webpages most relevant?

Evaluation Benchmark Reuters-21578 Data Set Reuters-21578 Data Set Most (over)used data set, 21,578 docs (each 90 types, 200 tokens) Most (over)used data set, 21,578 docs (each 90 types, 200 tokens) 9603 training, 3299 test articles (ModApte/Lewis split) 9603 training, 3299 test articles (ModApte/Lewis split) 118 categories 118 categories  An article can be in more than one category  Learn 118 binary category distinctions Average document (with at least one category) has 1.24 classes Average document (with at least one category) has 1.24 classes Only about 10 out of 118 categories are large Only about 10 out of 118 categories are large

Training size The more the better! (usually) The more the better! (usually) Results for text classification * Results for text classification * *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario Test error vs training size on the newsgroups rec.sport.baseball and rec.sport.hockey

Training size *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario Test error vs training size on the newsgroups alt.atheism and talk.religion.misc

Training size *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang, Slide from Nakov/Hearst/Rosario

Training Size Author identification Author identification Authorship Attribution a Comparison Of Three Methods, Matthew Care, Slide from Nakov/Hearst/Rosario

Violation of NB Assumptions Conditional independence Conditional independence “Positional independence” “Positional independence” Examples? Examples? Slide from Chris Manning

Naïve Bayes is Not So Naïve Naïve Bayes: first and second place in KDD-CUP 97 competition, among 16 (then) state of the art algorithms Naïve Bayes: first and second place in KDD-CUP 97 competition, among 16 (then) state of the art algorithms Goal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records. Robust to Irrelevant Features Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this. Very good in domains with many equally important features Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data A good dependable baseline for text classification (but not the best)! A good dependable baseline for text classification (but not the best)! Slide from Chris Manning

Naïve Bayes is Not So Naïve Optimal if the Independence Assumptions hold: Optimal if the Independence Assumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem Very Fast: Very Fast: Learning with one pass of counting over the data; testing linear in the number of attributes, and document collection size Low Storage requirements Low Storage requirements Online Learning Algorithm Online Learning Algorithm Can be trained incrementally, on new examples

SpamAssassin Naïve Bayes widely used in spam filtering Naïve Bayes widely used in spam filtering  Paul Graham’s A Plan for Spam A mutant with more mutant offspring...  Naive Bayes-like classifier with weird parameter estimation  But also many other things: black hole lists, etc. Many email topic filters also use NB classifiers Many email topic filters also use NB classifiers Slide from Chris Manning

SpamAssassin Tests Mentions Generic Viagra Mentions Generic Viagra Online Pharmacy Online Pharmacy No prescription needed No prescription needed Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN) Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN) Talks about Oprah with an exclamation! Talks about Oprah with an exclamation! Phrase: impress... girl Phrase: impress... girl From: starts with many numbers From: starts with many numbers Subject contains "Your Family” Subject contains "Your Family” Subject is all capitals Subject is all capitals HTML has a low ratio of text to image area HTML has a low ratio of text to image area One hundred percent guaranteed One hundred percent guaranteed Claims you can be removed from the list Claims you can be removed from the list 'Prestigious Non-Accredited Universities' 'Prestigious Non-Accredited Universities' http://spamassassin.apache.org/tests_3_3_x.html http://spamassassin.apache.org/tests_3_3_x.html http://spamassassin.apache.org/tests_3_3_x.html

Naïve Bayes: Word Sense Disambiguation w an ambiguous word w an ambiguous word s 1, …, s K senses for word w s 1, …, s K senses for word w words in the context of w v 1, …, v J words in the context of w P(s j ) prior probability of sense s j P(s j ) prior probability of sense s j P(v j |s k ) probability that word v j occurs in context of sense s k P(v j |s k ) probability that word v j occurs in context of sense s k

Text Classification The Naïve Bayes algorithm IP notice: most slides from: Chris Manning, plus some from William Cohen, Chien Chin Chen, Jason Eisner,

Similar presentations

Presentation on theme: "Text Classification The Naïve Bayes algorithm IP notice: most slides from: Chris Manning, plus some from William Cohen, Chien Chin Chen, Jason Eisner,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Classification The Naïve Bayes algorithm IP notice: most slides from: Chris Manning, plus some from William Cohen, Chien Chin Chen, Jason Eisner,

Similar presentations

Presentation on theme: "Text Classification The Naïve Bayes algorithm IP notice: most slides from: Chris Manning, plus some from William Cohen, Chien Chin Chen, Jason Eisner,"— Presentation transcript:

Similar presentations

About project

Feedback