Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Evaluation of Decision Forests on Text Categorization
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Data Mining Classification: Alternative Techniques
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Learning for Text Categorization
K nearest neighbor and Rocchio algorithm
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Modeling Modern Information Retrieval
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to machine learning
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Bayesian Networks. Male brain wiring Female brain wiring.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
A Neural Network Approach to Topic Spotting Presented by: Loulwah AlSumait INFS 795 Spec. Topics in Data Mining
The identification of interesting web sites Presented by Xiaoshu Cai.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Universit at Dortmund, LS VIII
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Classification Techniques: Bayesian Classification
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Chapter 23: Probabilistic Language Models April 13, 2004.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Evaluating Classification Performance
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Information Organization: Evaluation of Classification Performance.
Perceptrons Lirong Xia.
Text Categorization Assigning documents to a fixed set of categories
Information Retrieval
Word representations David Kauchak CS158 – Fall 2016.
Perceptrons Lirong Xia.
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.

Text Classification (TC) : Definition Infer a classification rule from a sample of labelled training documents (training set) so that it classifies new examples (test set) with high accuracy. Using the “ModApte” split, the ratio of training documents to test documents is 3:1

Three settings Binary setting (simplest). Only two classes, e.g. “relevant” and “non-relevant” in IR, “spam” vs. “legitimate” in spam filters. Multi-class setting, e.g. routing at a service hotline to one out of ten customer representatives, Can be reduced into binary tasks: “one against the rest” strategy. Multi-label setting – e.g. semantic topic identifiers for indexing news articles. An article can be in one, many, or no categories. Can also be split into a set of binary classification tasks.

Representing text as example vectors The basic blocks for representing text will be called indexing terms Word-based are most common. Very effective in IR, even though words such as “bank” have more than one meaning. Advantage of simplicity – split the input text into words by white space. Assume the ordering of words is irrelevant – the “bag of words” model. Only the frequency of each word in the document is recorded. “bag of words” model ensures that each document is represented by a vector of fixed dimensionality. Each component of the vector represents the value (e.g. the frequency of that word in that document, TF) of one attribute.

Other levels of text representation More sophisticated representations than the “bag-of- words” have not yet shown consistent and substantial improvements Sub-word level, e.g. n-grams are robust against spelling errors. See Kjell’s neural network. Multi-word level. May use syntactic phrase indexing such as noun phrases (e.g. adjective-noun) followed by co- occurrence patterns (e.g. speed limit) Semantic level. Latent Semantic Indexing (LSI) aims to automatically generate semantic categories based on a bag of words representation. Another approach would make use of thesauri.

Feature Selection To remove irrelevant or inappropriate attributes from the representation. Advantages are protection against over-fitting, and increased computational efficiency with fewer dimensions to work with. 2 most common strategies: a) Feature subset selection: use a subset of the original features b) Feature construction: new features are introduced by combining original features.

Feature subset selection techniques Stopword elimination (removes high frequency words) Document frequency thresholding (remove infrequent words, e.g. those occurring less than m times in the training corpus) Mutual information Chi-squared test (X²) But: an appropriate learning algorithm should be able to detect irrelevant features as part of the learning process.

Mutual Information We consider the association between a term t and a category c. How often do they occur together, compared with how common the term is, and how common is membership of the category? A is the number of times t occurs in c B is the number of times t occurs outside c C is the number of times t does not occur in c D is the number of times t does not occur outside c N = A + B + C + D. MI(t,c) = log (A.N / ((A + C)(A + B)) ) If MI > 0 then there is a positive association between t and c If MI = 0 there is no association between t and c If MI < 0 then t and c are in complementary distribution Units of MI are bits of information.

Chi-squared measure (X²) X²(t,c) = N.(AD-CB)² / (A+C).(B+D).(A+B).(C+D). E.g. X² for words in US as opposed to UK English (1990s) percent 485.2; U 383.3; toward 327.0; program 324.4; Bush 319.1; Clinton 316.8; President 273.2; programs 262.0; American 224.9; S These feature subset selection methods do not allow for dependencies between words, e.g. “click here”. See Yang and Pedersen (1997), A Comparative Study on Feature Selection in Text Categorisation.

Term Weighting A “soft” form of feature selection. Does not remove attributes, but adjusts their relative influence. Three components: Document component (e.g. binary, present in document = 1, absent = 0; term frequency (TF)) Collection component (e.g. inverse document frequency log (N / DF)) Normalisation component, so that large and small documents can be compared on the same scale e.g. 1 / sqrt(sum of xj²) The final weight is found by multiplying the 3 components

Feature Construction The new features should represent most of the information in the original representation while minimising the number of attributes. Examples of techniques are: Stemming Thesauri group words into semantic categories, e.g. synonyms can be placed in equivalence classes. Latent Semantic Indexing Term clustering

Learning Methods Naïve Bayes classifier Rocchio algorithm K-nearest neighbours Decision tree classifier Neural Nets Support Vector Machines

Naïve Bayesian Model (1) Spam Filter example from Sahimi et al. Odds(Rel|x) = Odds(Rel) * Pr(x|Rel) / Pr(x|NRel) Pr(“cheap” “v1agra” “NOW!” | spam) = Pr(“cheap”|spam) * Pr(“v1agra”|spam) * Pr(“NOW!”|spam) Only classify as spam if odds > 100 – 1 on.

Naïve Bayesian model (2) Sahimi et al. use word indicators, and also the following non-word indicators: Phrases: free money, only $, over 21 Punctuation: !!!! Domain name of sender:.edu less likely to be spam than.com Junk mail more likely to be sent at night than legitimate mail. Is recipient an individual user or a mailing list?

Our Work on the Enron Corpus - The PERC (George Ke)  Find a centroid c i for each category C i  For each test document x :  Find k nearest neighbouring training documents to x  Similarity between x and the training document d j is added to similarity between x and c i  Sort similarity scores sim(x,C i ) in descending order  Decision to assign x to C i can be made using various thresholding strategies

Rationale for the PERC Hybrid Approach Centroid method overcomes data sparseness: s tend to be short. kNN allows the topic of a folder to drift over time. Considering the vector space locally allows matching against features which are currently dominant.

Kjell: A Stylometric Multi-Layer Perceptron aa ab ac ad ae h1 h2 h3 o1 o2 (Shakespeare) (Marlowe) input layer “hidden” layer output layer w11

Performance Measures (PM) PM used for evaluating TC are often different from those optimised by the learning algorithms. Loss-based measures (error rate and cost models). Precision and recall-based measures.

Error Rate and Asymmetric Cost Error Rate is defined as the probability of the classification rule predicting the wrong class, Err = (f+- + f-+) / (f++ + f+- + f-+ + f--) Problem: negative examples tend to outnumber positive examples. So if we always guess “not in category”, it seems that we have a very low error rate. For many applications, predicting a positive example correctly is of higher utility than predicting a negative example correctly. We can incorporate this into the performance measure using a cost (or inversely, utility) matrix: Err = (C++f++ + C+-f+- + C-+f-+ + C--f--) / (f++ + f+- + f-+ + f--)

Precision and Recall The Recall of a classification rule is the probability that a document that should be in the category is classified correctly R = f++ / (f++ + f-+) Precision is the probability that a document classified into a category is indeed classified correctly P = f++ / (f++ + f+-) F = 2PR / (P + R) if P and R are equally important

Micro- and macro- averaging Often it is useful to compute the average performance of a learning algorithm over multiple training/test sets or multiple classification tasks. In particular for the multi-label setting, one is usually interested in how well all the labels can be predicted, not only a single one. This leads to the question of how the results of m binary tasks can be averaged to get a single performance value. Macro-averaging: the performance measure (e.g. R or P) is computed separately for each of the m experiments. The average is computed as the arithmetic mean of the measure over all experiments Micro-averaging: instead average the contingency tables found for each of m experiments, to produce f++(ave), f+-(ave), f-+(ave), f-- (ave). For recall, this implies R(micro) = f++(ave) / (f++ + f-+)