Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Slides:



Advertisements
Similar presentations
ADBIS 2007 Aggregating Multiple Instances in Relational Database Using Semi-Supervised Genetic Algorithm-based Clustering Technique Rayner Alfred Dimitar.
Advertisements

Text Categorization.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Chapter 5: Introduction to Information Retrieval
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Decision Tree Algorithm
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Evaluating Hypotheses
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004.
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
Experimental Evaluation
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
1 Automatic Indexing The vector model Methods for calculating term weights in the vector model : –Simple term weights –Inverse document frequency –Signal.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Text mining.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Presented by Tienwei Tsai July, 2005
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Text Classification, Active/Interactive learning.
Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection Rachna Vargiya and Philip Chan Department of Computer Sciences Florida.
Vector Space Models.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Data Mining and Decision Support
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Similarity Measures for Text Document Clustering
Clustering MacKay - Chapter 20.
Machine Learning Basics
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Representation of documents and queries
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Chapter 7: Transformations
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Presentation transcript:

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering

Introduction Text Clustering is the problem of automatically assigning predefined categories to free text documents Effective and Efficient Information Retrieval Organized Results Generating Taxonomy and Ontology Text or document is represented as a bag of words.

Introduction The major problem of this approach is the high dimensionality of the feature space. The feature space is consists of the unique terms that occur in documents which can be in tens or hundreds of thousands of terms. This is prohibitively high for many learning algorithms.

Introduction High dimensionality of feature space is a challenge for clustering algorithms because of the inherent data sparseness. Concept of proximity or clustering may not be meaningful in high dimensional feature space. The solution is to reduce the feature space dimensionality.

Feature Selection Feature selection methods include the removal of non-informative terms. The focus of this presentation is the evaluation and comparison of feature selection methods in the reduction of a high dimensional feature space in text clustering problems.

Feature Selection What are the strengths and weakness of existing feature selection methods applied to text clustering? To what extend can feature selection improve the accuracy of a classifier? How much of the document vocabulary can be reduced without losing useful information in category prediction?

Feature Selection Methods Give brief introduction on several feature selection methods Information Gain (IG) Χ2 Statistics (CHI) Document Frequency Term Strength (TS) Entropy-based Ranking Term Contribution

Information Gain (IG) Information gain is frequently employed as a term-goodness criterion in the field of machine learning. It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document.

Information Gain (IG) Let {ci}i = 1m denote the set of categories in the target space The information gain of term t is defined to be: G(t) = - Σi = 1m Pr(ci)logPr(ci) + Pr(t) Σi = 1m Pr(ci|t) log Pr(ci|t) + Pr(t-) Σi = 1m Pr(ci|t-) log Pr(ci|t-)

Information Gain (IG) Given a training corpus, for each unique term, information gain is computed, and removed from the feature space those terms whose information gain was less than some predetermined threshold. The computation includes the estimation of the conditional probabilities of a category given a term, and entropy computations. The probability estimation has a time complexity of O(N) and space complexity of O(VN) where N is the number of training documents and V is the vocabulary size.

Χ2 Statistics (CHI) The Χ2 statistic measures the lack of independence between t and c and can be compared to Χ2 distribution with one degree freedom. Using contingency table of a term t and a category c, where A is the number of times t and c co-occur, B is the number of time the t occurs without c, C is the number of times c occurs without t, D is the number of times neither c nor t occurs and N is the total number of documents, the term-goodness measure is

Χ2 Statistics (CHI) The Χ2 statistics has a natural value of zero if t and c are independent. For each category of Χ2 statistic between each unique term in a training corpus and that category Χ2avg (t) = Σ Pr(ci) Χ2 (t, ci)

Document Frequency (DF) Document frequency is the number of documents in which a term occurs. Document frequency is computed for each unique term in the training corpus and removed from the feature space those terms whose DF is less than some predetermined threshold. Rare terms are either non-informative for category prediction, or not influential in global performance. Observation: Low DF terms are assumed to be relatively informative and should not be removed aggressively.

Term Strength (TS) Term strength is originally proposed and evaluated by Wilbur and Sirotkin for vocabulary reduction in text retrieval. This methods estimates term importance based on how commonly a term is likely to appear in “closely-related” documents. It uses a training set of documents to derive documents pairs whose similarity is above threshold. Term strength is then computed based on the estimated conditional probability that a term occurs in the second half of a pair of related documents given that it occurs in the first half.

Entropy Based Ranking Consider each feature Fi as a random variable while fi as its value. From entropy theory, entropy is: E(F1,…,FM) = - Σf1 … ΣfM p(f1, …,fM) log(p(f1, …,fM) where p(f1, …,fM) is the probability or density at the point f1, …,fM. If the probability is uniformly distributed and we are most certain about the outcome, then entropy is maximum.

Entropy Based Ranking When the data has well-formed clusters, the uncertainty is low so is the entropy. In the real-world data, there are few cases that the clusters are well-formed. Two points belonging to the same cluster or 2 different clusters will contribute to the total entropy less that if they were uniformly separated. Similarity Si1,i2 between two instances Xi1 and Xi2 is high if the 2 instance are very close and Si1,i2 is low if the 2 are far away. Entropy Ei1,i2 will be low if Si1,i2 is either high or low, and Ei1,i2 will be low otherwise.

Entropy Based Ranking where Si,i is the similarity value between document di and dj and dj * Si, j is defined as follows: Si, j = e – α x disti,j α = - ln(0.5) / dist where disti,j is the distance between the document di and dj after the term t is removed

Term Contribution Text clustering is highly dependent on the documents similarity. Sim(di , dj ) = Σ f(t, di) x f(t, dj) where f(t, di) represents the weight of term t in document d tf * idf is also represents the weight of a term in document d where tf is term frequency and idf is the inverse document frequency

Term Contribution The contribution of each term is the overall contribution to documents’ similarities and shown by the following equation: TC(t) = Σ f(t, di) x f(t, dj)

Experiments The supervised feature selection methods are evaluated IG CHI The unsupervised feature selection methods are evaluated DF TS TC

Experiments K-Means algorithm is chosen to perform the actual clustering Entropy and Precision measures are used to evaluate the clustering performance 10 sets of initial centroids are chosen randomly Before performing clustering, tf * idf (with “ltc” scheme) is used to calculate the weight of each term.

Performance Measure Entropy Entropy measures the uniformity or purity of a cluster. The Entropy for all clusters is defined by the weighted sum of the entropy for all clusters where

Performance Measure Precision For each cluster, choose the class labels which shares most documents in a cluster becomes the final class label The final precision is defined as the weighted sum of the precision for all clusters

Data Sets Data sets are Reuters-21578, 20 Newsgroups and one web directory dataset (Web) Data set properties Data Sets Num of Classes Documents Num of Terms Avg Terms Avg DF Reuters 80 10733 18484 40.7 23.6 20NG 20 18828 91652 85.3 17.5 WEB 35 5035 56399 131.9 11.8

Results and Analysis Supervised Feature Selection IG and CHI feature selection methods are performed In general feature selection makes little progress on Reuters and 20NG Achieves much improvement on Web directory dataset Unsupervised Feature Selection DF, TS, TC and En feature selection methods are performed While 90% of terms removed, entropy is reduced by 2% and precision is increased by 1% When more terms are removed, the performance of unsupervised methods is dropped quickly, however, the performance of supervised methods is still improved