Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

Slides:

Advertisements

Similar presentations

An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.

Text Categorization.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chapter 5: Introduction to Information Retrieval

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.

Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

Text Categorization CSC 575 Intelligent Information Retrieval.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Learning for Text Categorization

Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.

Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Vector Space Model CS 652 Information Extraction and Integration.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.

Search Engines and Information Retrieval Chapter 1.

A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Carnegie Mellon University Presented by Ning Kang.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

The identification of interesting web sites Presented by Xiaoshu Cai.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Universit at Dortmund, LS VIII

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Chapter 6: Information Retrieval and Web Search

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,

Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Vector Space Models.

1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007.

TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Sentiment analysis algorithms and applications: A survey

Lecture 15: Text Classification & Naive Bayes

Presented by: Prof. Ali Jaoua

Text Categorization Assigning documents to a fixed set of categories

Presentation transcript:

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006

Synopsis Introduction Our Position Survey and Analysis: Supervised and Unsupervised Term Weighting Schemes Experiments Concluding Remarks

Introduction: Background Explosive growth of Internet and quick increase of textual information available Organizing and accessing these information in flexible ways Text Categorization is the task of classifying natural language documents into a predefined set of semantic categories

Introduction: Applications of TC Categorize web pages by topic (the directories like Yahoo!); Customize online newspapers to different labels according to a particular user’s reading preferences; Filter spam s and forward incoming s to the target expert by content; Word sense disambiguation is also taken as a text categorization task once we view word occurrence contexts as documents and word sense as categories.

Introduction Two Subtasks in Text Categorization: –1. Text Representation –2. Construction of Text Classifier

Introduction: Construction of Text Classifier Approaches to learn a classifier: –No more than 20 algorithms –Borrowed from Information Retrieval: Rocchio –Machine Learning: SVM, kNN, decision Tree, Naïve Bayesian, Neural Network, Linear Regression, Decision Rule, Boosting, etc. SVM is the best for TC.

Introduction: Text Representation Various text format, such as DOC, PDF, PostScript, HTML, etc. Can Computer read them like us? No. Convert them into a compact format in order to be recognized and categorized for classifiers or a classifier-building algorithm in a computer. This indexing procedure is also called text representation.

Introduction: Vector Spaces Model Texts are vectors in the term spaces. Assumption: Documents that are “close together” in space are similar in meaning.

Introduction: Text Representation Two main issues in Text Representation: –1. What a term is –2. How to weigh a term 1. What a term is –Sub-word level : syllables –Word level : single token –Multi-word level : phrases, sentences, etc –Syntactic and semantic : sense (meaning) Word level is best

Introduction: Text Representation 2. The weight of a term represents how much the term contributes to the semantic of document. How to weight a term (Our current focus) –Simplest method : binary –Most popular method : tf.idf –Combination with the linear classifier –Combination with information-theory metrics or statistics method : tf.chi2, tf.ig, tf.gr, tf.rf(we presented)

Our Position Leopold(2002) stated that text representation dominates the performance of text categorization rather than the kernel functions of SVMs. Little room to improve the performance from the algorithm aspect: –Excellent algorithms are few. –The rationale is inherent to each algorithm and the method is usually fixed for one given algorithm –Tuning the parameter has limited improvement

Our Position Analyze term’s discriminating power for text categorization Present a new term weighting method Compare supervised and unsupervised term weighting methods Investigate the relationship between different term weighting methods and machine learning algorithms

Survey and Analysis Salton’s three considerations for term weighting: –1. Term occurrence: binary, tf, ITF, log(tf) –2. Term’s discriminative power: idf Note: chi^2, ig (information gain), gr (gain ratio), mi (mutual information), or (Odds Ratio), etc. –3. Document length: cosine normalization, linear normalization

Survey and Analysis “Text categorization” is a form of supervised learning The prior information on the membership of training documents in predefined categories is useful –Feature selection –Supervised learning for classifier

Survey and Analysis Supervised term weighting methods –Use the prior information on the membership of training documents in predefined categories Unsupervised term weighting methods –Does not use. –binary, tf, log(1+tf), ITF –Most popular is tf.idf and its variants: logtf.idf, tf.idf-prob

Survey and Analysis: Supervised Term Weighting Methods 1. Combined with information-theory functions or statistic metrics –such as chi2, information gain, gain ratio, Odds ratio, etc. –Used in feature selection step –Select the most relevant and discriminating features for the classification task, that is, the terms with higher feature selection scores –The results are inconsistent and/or incomplete

Survey and Analysis: Supervised Term Weighting Methods 2. Interaction with supervised linear text classifier –Linear SVM and Perceptron –Text classifier selects the positive test documents from negative test documents by assigning different scores to the test samples, these scores are believed to be effective in assigning more appropriate weights to terms

Survey and Analysis: Analysis of term’s discriminating power Assume they have same tf value. t1, t2, t3 share the same idf1 value; t4, t5, t6 share same idf2 value. Clearly, the six terms contribute differently to the semantic of documents.

Survey and Analysis: Analysis of term’s discriminating power

Case 1. t1 contributes more than t2 and t3; t4 contributes more than t5 and t6. Case 2. t4 contributes more than t1 although idf(t4) < idf(t1). Case 3. for t1 and t3, if a(t1)=d(t3), b(t1)=c(t3), chi2(t1)=chi2(t3) but t1 may contribute more than t3

Survey and Analysis: Analysis of term’s discriminating power rf – relevance frequency The rf is only related to the ratio of b and c, not involve d The base of log is 2. in case of c=0, c=1

rfb c , Survey and Analysis: Analysis of term’s discriminating power

Survey and Analysis: Comparison of idf, rf and chi2 value of four features in two categories of Reuters Corpus feature Category: 00_acqCategory: 03_earn idfrfchi2idfrfchi2 Acquir Stake Payout dividend

Experimental Methodology: Methodology: eight term weighting schemes MethodsDenotationDescription Unsupervised Term Weighting binary0: absence, 1: presence tfTerm frequency alone tf.idfClassic tf.idf Supervised Term Weighting tf.rftf * rf rfbinary * rf tf.chi2Chi square tf.igig: information gain tf.oror: Odds Ratio

Experimental Methodology Methodology –8 commonly-used term weighting schemes –2 benchmark data collections Reuters News Corpus: skewed category distribution 20 Newsgroups Corpus: uniform category distribution –2 popular machine learning algorithms SVM and kNN –Micro- and Macro- averaged F1 measure –Significance tests

Experimental Results: 1. Results on Reuters News Corpus using SVM

Experimental Results: 2. Results on Reuters News Corpus using kNN

Experimental Results: 3. Results on 20Newsgroups Corpus using SVM

Experimental Results: 4. Results on 20Newsgroups Corpus using kNN

Experimental Results: McNemar’s Significance Tests AlgorithmCorpus#_fea Significance Test Results SVMR15937 (tf.rf, tf, rf) > tf.idf > (tf.ig, tf.chi2, binary) >> tf.or SVM (rf, tf.rf, tf.idf) > tf >> binary >> tf.or >> (tf.ig, tf.chi2) kNNR405 (binary, tf.rf) > tf >> (tf.idf, rf, tf.ig) > tf.chi2 >> tf.or kNN20494 (tf.rf, binary, tf.idf,tf) >> rf >> (tf.or, tf.ig, tf.chi2)

Experimental Discussion: Effects of feature set size on algorithms For SVM, almost all methods achieved the best performance when in putting the full vocabulary ( features) For kNN, the best performance achieved at a smaller feature set size ( features) Possible reason: different noise resistance

Experimental Discussion: Summary of different methods Generally, supervised and unsupervised methods have not shown universally consistent performance Except for tf.rf, it shows the best performance consistently rf alone, shows a comparable performance to tf.rf except for on Reuters using kNN

Experimental Discussion: Summary of different methods Specifically, the three typical supervised methods based on information theory, tf.chi2 (chi square), tf.ig (information gain) and tf.or (Odds ratio), are the worst methods.

Experimental Discussion: Summary of different methods The three unsupervised methods, tf, tf.idf and binary, show mixed results with respect to each other. –Example 1, tf better than tf.idf on Reuters using SVM and but it is the other way round on 20Newsgroups using SVM. –Example 2, these three are comparable to each other on 20 Newsgroups using kNN

Experimental Discussion: Summary of different methods 1. kNN favors binary while SVM does not 2. The good performance of tf.idf on 20Newsgroups using SVM and kNN may be attributed to the natural property of corpus, uniform category distribution 3. tf outperforms many other methods although it does not perform comparably to tf.rf

Concluding Remarks 1 Not all supervised term weighting methods have a consistent superiority over unsupervised term weighting methods. Specifically, three supervised methods based on information theory, i.e. tf.chi2, tf.ig and tf.or, perform rather poorly in all experiments.

Concluding Remarks 2 On the other hand, newly proposed supervised method, tf.rf achieved the best performance consistently and outperforms other methods substantially and significantly.

Concluding Remarks 3 Neither tf.idf nor binary shows consistent performance. –Specifically, tf.idf is comparable to tf.rf on the uniform category distribution corpus –binary is comparable to tf.rf on the kNN- based text classifier

Concluding Remarks 4 tf does not perform as well as tf.rf, but it performs consistently well and outperforms other methods consistently and significantly

Concluding Remarks We suggest tf.rf be used as term weighting method for TC task. The observations above are made based on the controlled experiments

Future Work Can we observe the similar results on more general experimental settings, such as different learning algorithms, different performance measures and other benchmark collections? Term weighting is the most basic component of text preprocessing methods, can we integrate it into various text mining tasks, such as information retrieval, text summarization, etc.?