1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

Slides:

Advertisements

Similar presentations

Text Categorization.

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Brief introduction on Logistic Regression

Albert Gatt Corpora and Statistical Methods Lecture 13.

4/25/2015 Marketing Research 1. 4/25/2015Marketing Research2 MEASUREMENT  An attempt to provide an objective estimate of a natural phenomenon ◦ e.g.

Text Categorization Karl Rees Ling 580 April 2, 2001.

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Chapter 7 – Classification and Regression Trees

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Categorizing Multimedia Documents Using Associated Text Thesis Proposal By Carl Sable.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.

Scalable Text Mining with Sparse Generative Models

Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.

Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.

Statistical Text Categorization By Carl Sable. Text Classification Tasks Text Categorization (TC) - Assign text documents to pre-existing, well-defined.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Text Classification, Active/Interactive learning.

Chapter 9 – Classification and Regression Trees

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

Today Ensemble Methods. Recap of the course. Classifier Fusion

1 E. Fatemizadeh Statistical Pattern Recognition.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Text Categorization and Images. Text Categorization Text categorization (TC) refers to the automatic labeling of documents, using natural language text.

Categorizing Multimedia Documents Using Associated Text.

Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Lecture 15: Text Classification & Naive Bayes

Data Mining Practical Machine Learning Tools and Techniques

iSRD Spam Review Detection with Imbalanced Data Distributions

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Parametric Methods Berlin Chen, 2005 References:

INF 141: Information Retrieval

Presentation transcript:

1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)

2 Overview I.Background: What is text categorization? II.Task and Corpus: Multimedia news documents III.Related Work: –Naïve Bayes –Smoothing & Speech Recognition –Binning in Information Retrieval IV.Our Proposal: Use bins for Text Categorization V.Results and Evaluation: Binning helps, best to combine VI.Using Unlabeled Data: Not helping at this time VII.Conclusions: Robust version of Naïve Bayes

3 Text Classification Tasks Text Categorization - Assign text documents to existing, well-defined categories Information Retrieval - Retrieve text documents which match user query Text Filtering - Retrieve documents which match a user profile Clustering - Group text documents into clusters of similar documents

4 Text Categorization Classify each test document by assigning category labels –Some tasks assume mutually exclusive categories –Binary categorization requires yes/no decision for every document/category pair. Most techniques require training –Manual labels collected to provide samples to system and also to create a test set –Expensive but makes evaluation much simpler Typically use “bag of words” approaches

5 Outdoor Indoor

6 Clues for Indoor/Outdoor: Text (as opposed to Vision) Denver Summit of Eight leaders begin their first official meeting in the Denver Public Library, June 21. Villagers look at the broken tail- end of the Fokker 28 Biman Bangladesh Airlines jet December 23, a day after it crash-landed near the town of Sylhet, in northeastern Bangladesh.

7 Event Categories PoliticsStruggle Disaster Crime Other

8 Manual Categorization Tool

9 Related Work Naïve Bayes Jelinek, 1998 –Smoothing techniques for Speech Recognition –Deleted Interpolation (binning) Umemura and Church, 2000 –Applied binning to Information Retrieval

10 Bin System: Naïve Bayes + Smoothing Binning: based on smoothing in speech recognition Not enough training data to estimate weights (log likelihood ratios) for each word –But there would be enough training data if we group words with similar “features” into a common “bin” Estimate a single weight for each bin –This weight is assigned to all words in the bin Credible estimates even for small counts (zeros)

11 IntuitionWord Indoor Freq Outdoor FreqIDF Clearly Indoor conference1414 bed108 Clearly Outdoor plane095 earthquake046 Unclear speech226 ceremony385

12 “plane” Sparse data First half of training set: –“plane” appears in 9 outdoor documents 0 indoor documents Infinitely more likely to be outdoor??? Assign “plane” to bins of words with similar features (e.g. IDF, counts)

13 Lambdas: Weights First half of training set: Assign words to bins Second half of training set: Calibrate –Average weights over words in bin

14 Lambdas for “plane”: 4.3 times more likely in an outdoor document

15 Binning  Credible Log Likelihood Ratios IntuitionWordLambda Indoor Freq Outdoor FreqIDF Clearly Indoor conference bed Clearly Outdoor plane earthquake046 Unclear speech ceremony

16 Does IDF really matter?

17 System Methodology Divide training set into two halves: –First half used to determine bins for words –Second half used to determine lambdas for bins For each test document: –Map every word to a bin for each category –Add lambdas, obtaining a score for each category Switch halves of training and repeat Combine results and assign each document to category with highest score

18 Evaluation Mutually exclusive categories Performance measured by overall accuracy:

19 Bins: Robust Version of Naïve Bayes Performance is often similar, but can be much better Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other

20 Bins: Robust Version of Naïve Bayes Performs well against other alternatives Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other

21 Combine Bins and Naïve Bayes Idea: –Might be better to use Naïve Bayes weight when there is enough evidence for word –Back off to bin otherwise System updated to allow combinations of weights based on level of evidence How can we automatically determine when to use which weights??? –Entropy –Minimum Squared Error (MSE)

22 Can Provide File to System that Specifies How to Combine Weights Based on Entropy: Based on MSE: Use only bins for evidence of 0 Weight bins and NB equally for evidence of 1 Use only NB for evidence of 1 or more

23 Best Performance Yet Indoor/Outdoor Events: Politics, Struggle, Disaster, Crime, Other

24 Attempts to Improve Results One idea: Label more documents! –Usually works –Boring Another idea: Use unlabeled documents! –Easily obtainable –But can this really work??? –Maybe it can…

25 Binning Using Unlabeled Documents Apply system to unlabeled documents Choose documents with “confident” predictions –Each word has new feature: Number of occurrences in documents predicted to belong to each category –Probably less important than number of occurrences in documents definitely belonging to category –Bins provides natural means of weighting new feature Bins based on original counts (from training data) and new counts (from unlabeled data)

26 Should the New Feature Matter?

27 Did the New Feature Help? No Why??? –New features add info but make bins smaller –Perhaps more data isn’t needed in the first place Should more data matter? –Hard to accumulate more labeled data –Easy to try out less labeled data!

28 Does Size Matter?

29 Conclusions Binning: Robust version of Naïve Bayes –Smoothing is good –Reliable log-likelihood ratios even for small counts: plane: 9 outdoor docs, 0 indoor docs  –4.3 times more likely to be outdoor than indoor –Usually improves performance –Best if combine it with Naïve Bayes Unlabeled data –Not helping with our tasks –Same methodology might help with other tasks

30