A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Continued Psy 524 Ainsworth
Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Evaluation of Decision Forests on Text Categorization
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Unsupervised Transfer Classification Application to Text Categorization Tianbao Yang, Rong Jin, Anil Jain, Yang Zhou, Wei Tong Michigan State University.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
TEXT CLASSIFICATION CC437 (Includes some original material by Chris Manning)
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)
Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.
1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Text mining.
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Bayesian Networks. Male brain wiring Female brain wiring.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
R. E. Wyllys Copyright 2003 by R. E. Wyllys Last revised 2003 Jan 15
Cross Domain Distribution Adaptation via Kernel Mapping
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.
Usman Roshan Machine Learning
Efficient Ranking of Keyword Queries Using P-trees
Statistical Reporting Format
SIMPLE LINEAR REGRESSION MODEL
Feature selection Usman Roshan.
Inference about the Slope and Intercept
Implementation Details of the Text Classification Project
Part IV Significantly Different Using Inferential Statistics
Inference about the Slope and Intercept
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
Usman Roshan Machine Learning
Feature Selection Methods
How do you know if the variation in data is the result of random chance or environmental factors? O is the observed value E is the expected value.
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang, CMU Jan O. Pedersen, Verity, Inc. Presented By: Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo

Introduction This paper is a comparative study of feature selection methods in statistical learning of text categorization. Five methods were evaluated: –Document Frequency (DF) –Information Gain (IG) –Mutual Information (MI) –  2 test (CHI) –Term Strength (TS)

Document Frequency (DF) Document Frequency is the number of documents in which a term occurs. Terms whose document frequency is less than some predetermined threshold, are removed from the feature space. The basic assumption is that rare terms are either non-informative for category prediction, or not influential in global performance. However, this assumption must be handled carefully.

Information Gain (IG) IG measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. For a term t, and set of classes c i : G (t) = -  i=1 to m Pr (c i ) log Pr (c i ) + Pr(t)  i=1 to m Pr (c i |t) log Pr (c i |t) + Pr(  t)  i=1 to m Pr (c i |  t) log Pr (c i |  t)

Information Gain (IG)… Given a training corpus, for each unique term, IG is computed, and those terms are removed from the feature space whose IG is less than some predetermined threshold.

Mutual Information (MI) Each word is ranked according to its mutual information with respect to the class labels. Mutual information criterion is defined as: I(t, c) = log [ Pr (t  c) / {Pr(t) · Pr(c)} ] Category specific scores are often combined as: I avg (t) =  i=1 to m Pr (c i ) I (t, c i ) I max (t) = max i=1 to m I (t, c i )

 2 statistic (CHI) The  2 statistic measures the lack of independence between t and c.  2 statistic is known to be not reliable for low-frequency terms.

Term Strength (TS) This method estimates term importance based on how commonly a term is likely to appear in ‘closely-related’ documents. It uses a training set of documents to derive document pairs whose similarity is above a threshold. This criterion is based on document clustering, assuming that documents with many shared words are related, and that terms in the heavily overlapping area of related documents are relatively informative.

Conclusion IG and CHI were found to be most effective in aggressive term removal without losing categorization accuracy in experiments with kNN and LLSF (Linear Least Squares Fit) on Reuters and OHSUMED collection. DF is found comparable to IG and CHI with up to 90% term removal, while TS is comparable with up to 50-60%. MI has inferior performance.