Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione.
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Dimensionality Reduction PCA -- SVD
Supervised Learning Recap
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
Mutual Information Mathematical Biology Seminar
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Unsupervised Learning
What is Cluster Analysis?
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Evaluation of Utility of LSA for Word Sense Discrimination Esther Levin, Mehrbod Sharifi, Jerry Ball
Summarized by Soo-Jin Kim
Presented By Wanchen Lu 2/25/2013
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Text mining.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Non Negative Matrix Factorization
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
SINGULAR VALUE DECOMPOSITION (SVD)
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Data Mining and Text Mining. The Standard Data Mining process.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
School of Computer Science & Engineering
Presented by: Prof. Ali Jaoua
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Concept Decomposition for Large Sparse Text Data Using Clustering
Dimension reduction : PCA and Clustering
Large scale multilingual and multimodal integration
Restructuring Sparse High Dimensional Data for Effective Retrieval
Lecture 16. Classification (II): Practical Considerations
What is Artificial Intelligence?
Presentation transcript:

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Siena - Italy

WI 2004 Pseudo-Supervised Clustering for Text Documents2 Outline  Document representation  Pseudo-Supervised Clustering  Evaluation of cluster quality  Experimental results  Conclusions

WI 2004 Pseudo-Supervised Clustering for Text Documents3 Vector Space Model  Representation with a term-weight vector in the vocabulary space d i = [ w i,1, w i,2, w i,3, …, w i,v ]’  A commonly used weight scheme is TF-IDF  Documents are compared using the cosine correlation

WI 2004 Pseudo-Supervised Clustering for Text Documents4 Vector Space Model: Limitations  High dimensionality  Each term is an independent component in the document representation the semantic relationships between words are not considered  Many irrelevant features feature selection may be difficult especially for unsupervised tasks  Vectors are very sparse

WI 2004 Pseudo-Supervised Clustering for Text Documents5 Vector Space Model: Projection  Projection to a lower dimensional space Definition of a basis for the projection  Use of statistical properties of the word-by- document matrix on a given corpus  SVD decomposition (Latent Semantic Analysis)  Concept Matrix Decomposition [Dhillon & Modha, Machine Learning,2001]  Data partition + SVD/CMD for each partition (Partially) supervised partitioning

WI 2004 Pseudo-Supervised Clustering for Text Documents6 Singular Value Decomposition  SVD of the |V|x|D| word-by-document matrix (|D|>|V|) The orthonormal matrix U represents a basis for document representation The k columns corresponding to the largest singular values in ∑ form the basis for the projected space

WI 2004 Pseudo-Supervised Clustering for Text Documents7 Concept Matrix Decomposition-1  Use a basis which describes a set of k concepts represented by k reference term distributions  The projection into the concept space is obtained by solving

WI 2004 Pseudo-Supervised Clustering for Text Documents8 Concept Matrix Decomposition-2  The k concept vectors c i can be obtained as the normalized centroids of a partition of the document collection D D = {D 1, D 2, …., D k }  CMD exploits the prototypes of certain homogenous sets of documents

WI 2004 Pseudo-Supervised Clustering for Text Documents9 Pseudo-Supervised Clustering  Selection of the projection basis using a supervised partition of the document set Determine a partition  of a reference subset T of the document corpus Select a basis B i for each set  i in the partition using SVD/CMD Project the documents using the basis B=U i B i Apply a clustering algorithm to the document corpus represented using the basis B Eventually iterate refining the choice of the reference subset

WI 2004 Pseudo-Supervised Clustering for Text Documents10 Pseudo SVD-1  The SVD is computed for the documents in each subset  i in   The basis B i is composed of the v i left singular vectors U i  The new basis B is represented by the matrix

WI 2004 Pseudo-Supervised Clustering for Text Documents11 Pseudo SVD-2  The Pseudo-SVD representation of the word- by-document matrix of the corpus is the matrix Z* computed as  The projection requires the solution of a least mean square problem

WI 2004 Pseudo-Supervised Clustering for Text Documents12 Pseudo CMD-1  An orthogonal basis is computed as follows Compute the centroid (concept vector) of each subset  i in  Compute the word cluster for each concept vector  A word belongs to the word cluster W i of subset  i if its weight in the concept vector c i is greater then its weights in the other concept vectors  Each word is assigned to only one subset  i Represent the documents in  i using only the features in the corresponding word cluster W i Compute the partition of  i into v i clusters and compute the word vectors of each centroid

WI 2004 Pseudo-Supervised Clustering for Text Documents13 Pseudo CMD-2  Each partition  i is represented by a set of v i directions obtained from the concept vectors c’ ij These vectors are orthogonal since each word belongs to only one c ij  Document projection

WI 2004 Pseudo-Supervised Clustering for Text Documents14 Evaluation of cluster quality  Experiments on pre-classified documents  Measure of the dispersion of the classes among the clusters Contingency table: the matrix H, whose element h(A i,C j ) is the number of items with label A i assigned to the cluster C j. Accuracy  “classification using majority voting” Conditional Entropy  “confusion” in each cluster Human Evaluation

WI 2004 Pseudo-Supervised Clustering for Text Documents15 Experimental results-1  Data Preparation Parsing of PDF file Term filtering using the Aspell library Removal of the stop words Application of the Luhn Reduction to remove common words

WI 2004 Pseudo-Supervised Clustering for Text Documents16 Experimental result-2  Data Set (conference papers) N.NameN. File 1Fuzzy Control112 2Biologic. Evolutionary Comput.240 3Agent System118 4Global Brain Models171 5Wavelets Applications68 6Chaotic Systems70 7Neural Networks134 8Clustering and Classification86 9Image Analysis and Vision114 10PCM and SVM104

WI 2004 Pseudo-Supervised Clustering for Text Documents17 Experimental result-3  We applied k-means using three different document representations: original vocabulary basis Pseudo-SVD (PSVD) Pseudo-CMD (PCMD)  Each algorithm was applied setting the number of clusters to 10  For PSVD and PCMD, we varied the number of principal components

WI 2004 Pseudo-Supervised Clustering for Text Documents18 Experimental result-4 MethodsEntropyAccuracyHuman evaluated PCMD PCMD PCMD PSVD PSVD PSVD K-means

WI 2004 Pseudo-Supervised Clustering for Text Documents19 Experimental result-5 Topic Distribution for the Pseudo-SVD algorithm with v=7

WI 2004 Pseudo-Supervised Clustering for Text Documents20 Experimental result-6  Analyzing the results: Low Accuracy High Entropy  Due to: Data set has many transversal topics (for es. Class 5 ->Wavelets) We have evaluated the accuracy using the expert’s evaluations.

WI 2004 Pseudo-Supervised Clustering for Text Documents21 Experimental result-7 Human expert’s evaluation of cluster accuracy

WI 2004 Pseudo-Supervised Clustering for Text Documents22 Conclusions  We have presented two clustering algorithms for text documents which use a clustering step also in definition of the basis for the document representation  We can exploit the prior knowledge of human expert about the data set and bias the feature reduction step towards a more significant representation  The results show that PSVD algorithm is able to perform better than vocabulary Tf-Idf representation and PCMD

WI 2004 Pseudo-Supervised Clustering for Text Documents23 Thanks for your attention!!!

WI 2004 Pseudo-Supervised Clustering for Text Documents24 Appendix: Vector Space Model-2  Cosine correlation  Two vectors x i and x j are similar if:

WI 2004 Pseudo-Supervised Clustering for Text Documents25 Appendix: Contingency and Confusion Matrix  If you associate the cluster C j to the topic A m(j) for which C j has the maximum number of documents and you rearrange the column of H such that j’=m(j), you obtain the confusion matrix F m

WI 2004 Pseudo-Supervised Clustering for Text Documents26 Appendix: Pseudo CMD-2  For each word-by-document matrix for cluster i, we keep only the components related to the words in the word cluster W j  We sub-partition each new matrix to obtain more than one direction for each original partition

WI 2004 Pseudo-Supervised Clustering for Text Documents27 Appendix: Evaluation of cluster quality- 2  Accuracy  Classification Error

WI 2004 Pseudo-Supervised Clustering for Text Documents28 Evaluation of cluster quality-3  Conditional Entropy Where

WI 2004 Pseudo-Supervised Clustering for Text Documents29 Thanks for your attention!!!

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Siena - Italy