Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML’2003 Presented by Dehong Liu.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Feature Selection for Regression Problems

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.

Data mining and statistical learning - lecture 13 Separating hyperplane.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Scalable Text Mining with Sparse Generative Models

Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Overview of Search Engines

Ensemble Learning (2), Tree and Forest

Radial Basis Function Networks

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

SINGULAR VALUE DECOMPOSITION (SVD)

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.

Stable Feature Selection for Biomarker Discovery Name: Goutham Reddy Bakaram Student Id: Instructor Name: Dr. Dongchul Kim Review Article by Zengyou.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Ultra-high dimensional feature selection Yun Li

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

From Frequency to Meaning: Vector Space Models of Semantics

Machine Learning Feature Creation and Selection

Presented by: Prof. Ali Jaoua

Information Organization: Clustering

Special Topics in Text Mining

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang

Chapter 7: Transformations

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Restructuring Sparse High Dimensional Data for Effective Retrieval

Data Pre-processing Lecture Notes for Chapter 2

Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.

Presentation transcript:

Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011

Techniques to improve document clustering

Cluster ensembles

Definition of the task Clustering ensembles combine multiple partitions generated by different clustering algorithms into a single clustering solution The goal is to improve robustness, stability and accuracy of clustering solutions. What does it mean this? It has been noticed that a random choice of an ensemble is not as hazardous as a random choice of a single clustering algorithm Special Topics on Information Retrieval 4

Advantages of cluster ensembles Robustness: Better average performance across the domains and datasets Novelty: Finding a combined solution unattainable by any single clustering algorithm. Stability: Clustering solutions with lower sensitivity to noise, outliers, or sampling variations. They also have the ability to integrate solutions from multiple distributed sources of data or features ( multimodal applications? ) Special Topics on Information Retrieval 5

Two main issues Clustering ensembles are two stage algorithms. First, they store results of several clustering algorithms. Then, they use a specific consensus function to find a final partition from the data. Two main issues: – Design of the individual clustering algorithms so that they form potentially an accurate ensemble – Design of different ways to combine the outputs of individual clustering algorithms Special Topics on Information Retrieval 6

General architecture Which clustering method(s) to consider? How to combine their outputs? Special Topics on Information Retrieval 7 Clustering Algorithm 1 Clustering Algorithm 2 Clustering Algorithm N : Consensus Function Raw data Final partition

Individual clustering algorithms Two important factors: 1.Diversity within the ensemble 2.Accuracy of individual clustering algorithms To achieve diverse clustering outputs: – Using different clustering algorithms – Changing initialization or other parameters of a clustering algorithm – Using different features – Partitioning different subsets of the original data Special Topics on Information Retrieval 8  

Combination of partitions A primary design element of a consensus clustering algorithm is the aggregate representation that is constructed for the ensemble of partitions. According to the used representation, there are methods that: – Induce a similarity-based structure from a given ensemble – View the ensemble partitions in a categorical feature- space representation – Derive a voting-based aggregated partition, which is a soft partition, to represent a given ensemble Special Topics on Information Retrieval 9

Co-association matrix It represents pairwise similarities between each pair of data objects. – It corresponds to an undirected graph where nodes represent objects and edges are weighted according to a defined measure of similarity. In this matrix, similarities reflect the frequency of co-occurrence of each pair of objects in the same cluster throughout the ensemble. – It is the average of the adjacency matrices of the different individual clustering outputs Special Topics on Information Retrieval 10

Example of the co-association matrix How to build the final partition based on the co-association matrix? Special Topics on Information Retrieval 11

Using the co-association matrix The co-association matrix (M) has been used in a number of different ways to extract a consensus partition: – As similarity matrix It can be directly used by any clustering algorithm If distances are necessary, then use 1 – M. – As a data representation Each one of the N documents is represented by N features Use this matrix to compute a similarity matrix. That means a second order computation of similarities. Special Topics on Information Retrieval 12

Using the co-association matrix (2) Special Topics on Information Retrieval 13 Similarity Computation Clustering Algorithm N M N N Document-feature matrix Document similarity matrix Clustering Algorithm N N Co-association matrix Similarity Computation Clustering Algorithm N N Document similarity matrix N Co-association matrix N

Final remarks From the two options, using the co-association matrix as data has reported the best results. Some disadvantages of this approach are: – It has a quadratic complexity in the number of objects – There are no established guidelines concerning which clustering algorithm should be applied. – An ensemble with a small number of clusterings may not provide a reliable estimate of the co- association values Special Topics on Information Retrieval 14

Final remarks (2) Useful to apply document cluster ensembles: – To handle high-dimensionality: aggregate results obtained from different subsets of features – To take advantage of features from different modalities; e.g., thematic and stylistic features, content and link features, etc. – To achieve multilingual document clustering in parallel or comparable corpora. Special Topics on Information Retrieval 15

Feature selection

Motivation Most document clustering applications concern large and high-dimensional data – Several documents, thousands of features Because the high dimensionality there is also a problem of data sparseness. The problem for clustering is that the distance between similar documents is not very different than the distance between more dissimilar documents. How to solve this problem? Special Topics on Information Retrieval 17

Dimensionality reduction Special Topics on Information Retrieval 18 Dimensionality reduction Feature Extraction Feature Selection Supervised approaches Unsupervised approaches Filter methods Wrapper methods Extracts a set of new features from the original features through some functional mapping, such as principal component analysis, or word clustering  not a clear meaning. For applications where class information is available

Two variables for feature selection FS is typically carried out considering: – Feature relevance. Ability to differentiate between instances from different clusters – Feature redundancy. Correlation between the values of two features Special Topics on Information Retrieval 19

Two approaches for feature selection Filters – Rank features individually using an evaluation function, and eliminates those with an inadequate score  using some threshold or number of features Wrappers – Assume that the goal of clustering is to optimize some objective function which helps to obtain ’good’ clusters, and use this function to estimate the quality of different feature subsets Iterative process; high computational complexity – They require: a search strategy, a clustering algorithm, evaluation function, stop criteria. Some ideas? Special Topics on Information Retrieval 20

Filter methods for document clustering Document frequency – Simplest criterion; penalizes rare terms Term frequency variance – Favors frequent terms; penalizes terms with uniform distributions. Special Topics on Information Retrieval 21

More filter methods Term strength – Indicates the conditional probability that a term occurs in a document j given that it appears in a related document i. –  is a user specified value. – It can be enhanced by considering the probability of a term to occur in two not-similar documents Special Topics on Information Retrieval 22

More filter methods Entropy based feature ranking – Two documents clearly belonging to the same or to different clusters will contribute to the total entropy less than if they were uniformly separated. Low and high similarities are associated to small entropy values. – The idea is to evaluate the entropy after the removal of each one of the features; and eliminate the features causing more disorder. Special Topics on Information Retrieval 23

More filter methods Term contribution – Clustering depends on documents similarity; and their similarity relies on their common terms. – This measures indicates the contribution of a term to the global document’s similarity. – It is an extension of DF, where terms have different weights. Special Topics on Information Retrieval 24

How good are unsupervised filter methods? Results on news clustering, and their comparison against supervised approaches Very good for reductions smaller than 80%; TS showed the best performance. Special Topics on Information Retrieval 25 Reuters collections 20 News groups

Clustering short texts

Short text difficulties It is difficult to statistically evaluate the relevance of words – Most of them have frequencies equal to 1 or 2 The BoW representations from the documents result in very sparse vectors – Each document only has a small subset of the words Special Topics on Information Retrieval 27 The distance between similar documents is not very different than the distance between more dissimilar documents

Another difficulty: narrowness Clustering documents from a narrow domain is difficult since they have a high vocabulary overlapping level. Special Topics on Information Retrieval 28 Sports politics Entertainment Document Clustering Information Retrieval Text classification Not very high intersection in the vocabulary Simple to cluster/classify even if they are short texts Very high intersection in the vocabulary Complex to cluster/classify; more complex if they are short texts

How to deal with these texts? How to know when texts are short? How to know if they are narrow-domain? How to improve the clustering of these documents? – A better representation? – A different similarity measure? – Another clustering algorithm? Ideas? Special Topics on Information Retrieval 29

Shortness evaluation Arithmetic mean of document lengths Arithmetic mean of the vocabulary lengths The radio of vocabulary average to the document length average. Special Topics on Information Retrieval 30 Are the documents from a given collection short? What measure is better? Advantages and disadvantages?

Some results DL and VL better to evaluate the shortness of documents It seems that VDR assesses the complexity of the corpus and not exactly the shortness. Special Topics on Information Retrieval 31 Pinto, D., Rosso, P., Jim´enez-Salazar, H.: A self-enriching methodology for clustering narrow domain short texts. The Computer Journal (2010). DOI /comjnl/bxq069

Broadness evaluation If the distribution of words (n-grams) is similar for different subsets of the collection, then it is narrow-domain In a narrow domain documents share a greater number of vocabulary terms Special Topics on Information Retrieval 32

Some results Both show a strong agreement with the manual ranking (correlation of 0.56) – For its simplicity, UVB seems a better option than UMLB Both agree on the fact that scientific documents are narrow domain and news reports belong to a wide domain. Special Topics on Information Retrieval 33

Dealing with short texts Main approach consists in expand the documents by including related information The related information has been extracted from: – The own document collection – Wordnet (or other thesaurus) – Wikipedia How to do that? Special Topics on Information Retrieval 34

Self-term expansion – Term co-occurrence is measured by the pointwise mutual information method. – In order to obtain the new documents each original term is enriched by adding its corresponding set of related terms – It is important to notice that this method does not modify the dimensionality of the feature space Special Topics on Information Retrieval 35 It consist of replacing terms of a document with a set of co-related terms

Expansion using Wordnet Most clustering methods only relate documents that use identical terminology, while they ignore conceptual similarity of terms The idea of these approach is to represent the documents by synsets and not by the words – A synset is a set of synonym words such as {car, auto, automobile, machine, motorcar} It seems a very good idea to reduce the dimensionality and sparseness of the representation, but … Problems? Special Topics on Information Retrieval 36

Disadvantages of Wordnet-based expansion Some languages do not have a Wordnet or have a very reduced version. – Clustering in Spanish vs. clustering in English Specific domain knowledge tend to be poorly covered by Wordnets – Clustering news vs. clustering scientific papers The main problem is to achieve a correct word sense disambiguation – Bank  represented by “financial institution” synset  represented by “sloping land” synset Special Topics on Information Retrieval 37

Expansion using Wikipedia Motivated by the fact that there are Wikipedias in more languages than Wordnets The method is very simple: – Index all Wikipedia pages using BoW representation – Search for the k most similar pages for each short document of the given collection – Expand the documents using the title’s words from the k retrieved pages – Perform some unsupervised feature selection procedure to eliminate irrelevant (rare) words Problems of this approach? Special Topics on Information Retrieval 38