1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

2 Overview I.Background: What is text categorization? II.Task and Corpus: Multimedia news documents III.Related Work: –Naïve Bayes –Smoothing & Speech Recognition –Binning in Information Retrieval IV.Our Proposal: Use bins for Text Categorization V.Results and Evaluation: Binning helps, best to combine VI.Using Unlabeled Data: Not helping at this time VII.Conclusions: Robust version of Naïve Bayes

3 Text Classification Tasks Text Categorization - Assign text documents to existing, well-defined categories Information Retrieval - Retrieve text documents which match user query Text Filtering - Retrieve documents which match a user profile Clustering - Group text documents into clusters of similar documents

4 Text Categorization Classify each test document by assigning category labels –Some tasks assume mutually exclusive categories –Binary categorization requires yes/no decision for every document/category pair. Most techniques require training –Manual labels collected to provide samples to system and also to create a test set –Expensive but makes evaluation much simpler Typically use “bag of words” approaches

5 Outdoor Indoor

6 Clues for Indoor/Outdoor: Text (as opposed to Vision) Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. Villagers look at the broken tail- end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh.

7 Event Categories PoliticsStruggle Disaster Crime Other

8 Manual Categorization Tool

9 Related Work Naïve Bayes Jelinek, 1998 –Smoothing techniques for Speech Recognition –Deleted Interpolation (binning) Umemura and Church, 2000 –Applied binning to Information Retrieval

10 Bin System: Naïve Bayes + Smoothing Binning: based on smoothing in speech recognition Not enough training data to estimate weights (log likelihood ratios) for each word –But there would be enough training data if we group words with similar “features” into a common “bin” Estimate a single weight for each bin –This weight is assigned to all words in the bin Credible estimates even for small counts (zeros)

11 IntuitionWord Indoor Freq Outdoor FreqIDF Clearly Indoor conference1414 bed108 Clearly Outdoor plane095 earthquake046 Unclear speech226 ceremony385

12 “plane” Sparse data First half of training set: –“plane” appears in 9 outdoor documents 0 indoor documents Infinitely more likely to be outdoor??? Assign “plane” to bins of words with similar features (e.g. IDF, counts)

13 Lambdas: Weights First half of training set: Assign words to bins Second half of training set: Calibrate –Average weights over words in bin

14 Lambdas for “plane”: 4.3 times more likely in an outdoor document

15 Binning  Credible Log Likelihood Ratios IntuitionWordLambda Indoor Freq Outdoor FreqIDF Clearly Indoor conference4.841414 bed1.35108 Clearly Outdoor plane-2.11095 earthquake046 Unclear speech0.84226 ceremony-0.50385

16 Does IDF really matter?

17 System Methodology Divide training set into two halves: –First half used to determine bins for words –Second half used to determine lambdas for bins For each test document: –Map every word to a bin for each category –Add lambdas, obtaining a score for each category Switch halves of training and repeat Combine results and assign each document to category with highest score

18 Evaluation Mutually exclusive categories Performance measured by overall accuracy:

19 Bins: Robust Version of Naïve Bayes Performance is often similar, but can be much better Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other

20 Bins: Robust Version of Naïve Bayes Performs well against other alternatives Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other

21 Combine Bins and Naïve Bayes Idea: –Might be better to use Naïve Bayes weight when there is enough evidence for word –Back off to bin otherwise System updated to allow combinations of weights based on level of evidence How can we automatically determine when to use which weights??? –Entropy –Minimum Squared Error (MSE)

22 Can Provide File to System that Specifies How to Combine Weights 0 0.5 1 0 0.25 0.5 0.75 1 Based on Entropy: Based on MSE: Use only bins for evidence of 0 Weight bins and NB equally for evidence of 1 Use only NB for evidence of 1 or more

23 Best Performance Yet Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other

24 Attempts to Improve Results One idea: Label more documents! –Usually works –Boring Another idea: Use unlabeled documents! –Easily obtainable –But can this really work??? –Maybe it can…

25 Binning Using Unlabeled Documents Apply system to unlabeled documents Choose documents with “confident” predictions –Each word has new feature: Number of occurrences in documents predicted to belong to each category –Probably less important than number of occurrences in documents definitely belonging to category –Bins provides natural means of weighting new feature Bins based on original counts (from training data) and new counts (from unlabeled data)

26 Should the New Feature Matter?

27 Did the New Feature Help? No Why??? –New features add info but make bins smaller –Perhaps more data isn’t needed in the first place Should more data matter? –Hard to accumulate more labeled data –Easy to try out less labeled data!

28 Does Size Matter?

29 Conclusions Binning: Robust version of Naïve Bayes –Smoothing is good –Reliable log-likelihood ratios even for small counts: plane: 9 outdoor docs, 0 indoor docs  –4.3 times more likely to be outdoor than indoor –Usually improves performance –Best if combine it with Naïve Bayes Unlabeled data –Not helping with our tasks –Same methodology might help with other tasks

1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

Similar presentations

Presentation on theme: "1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

Similar presentations

Presentation on theme: "1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)"— Presentation transcript:

Similar presentations

About project

Feedback