Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Siena - Italy
WI 2004 Pseudo-Supervised Clustering for Text Documents2 Outline Document representation Pseudo-Supervised Clustering Evaluation of cluster quality Experimental results Conclusions
WI 2004 Pseudo-Supervised Clustering for Text Documents3 Vector Space Model Representation with a term-weight vector in the vocabulary space d i = [ w i,1, w i,2, w i,3, …, w i,v ]’ A commonly used weight scheme is TF-IDF Documents are compared using the cosine correlation
WI 2004 Pseudo-Supervised Clustering for Text Documents4 Vector Space Model: Limitations High dimensionality Each term is an independent component in the document representation the semantic relationships between words are not considered Many irrelevant features feature selection may be difficult especially for unsupervised tasks Vectors are very sparse
WI 2004 Pseudo-Supervised Clustering for Text Documents5 Vector Space Model: Projection Projection to a lower dimensional space Definition of a basis for the projection Use of statistical properties of the word-by- document matrix on a given corpus SVD decomposition (Latent Semantic Analysis) Concept Matrix Decomposition [Dhillon & Modha, Machine Learning,2001] Data partition + SVD/CMD for each partition (Partially) supervised partitioning
WI 2004 Pseudo-Supervised Clustering for Text Documents6 Singular Value Decomposition SVD of the |V|x|D| word-by-document matrix (|D|>|V|) The orthonormal matrix U represents a basis for document representation The k columns corresponding to the largest singular values in ∑ form the basis for the projected space
WI 2004 Pseudo-Supervised Clustering for Text Documents7 Concept Matrix Decomposition-1 Use a basis which describes a set of k concepts represented by k reference term distributions The projection into the concept space is obtained by solving
WI 2004 Pseudo-Supervised Clustering for Text Documents8 Concept Matrix Decomposition-2 The k concept vectors c i can be obtained as the normalized centroids of a partition of the document collection D D = {D 1, D 2, …., D k } CMD exploits the prototypes of certain homogenous sets of documents
WI 2004 Pseudo-Supervised Clustering for Text Documents9 Pseudo-Supervised Clustering Selection of the projection basis using a supervised partition of the document set Determine a partition of a reference subset T of the document corpus Select a basis B i for each set i in the partition using SVD/CMD Project the documents using the basis B=U i B i Apply a clustering algorithm to the document corpus represented using the basis B Eventually iterate refining the choice of the reference subset
WI 2004 Pseudo-Supervised Clustering for Text Documents10 Pseudo SVD-1 The SVD is computed for the documents in each subset i in The basis B i is composed of the v i left singular vectors U i The new basis B is represented by the matrix
WI 2004 Pseudo-Supervised Clustering for Text Documents11 Pseudo SVD-2 The Pseudo-SVD representation of the word- by-document matrix of the corpus is the matrix Z* computed as The projection requires the solution of a least mean square problem
WI 2004 Pseudo-Supervised Clustering for Text Documents12 Pseudo CMD-1 An orthogonal basis is computed as follows Compute the centroid (concept vector) of each subset i in Compute the word cluster for each concept vector A word belongs to the word cluster W i of subset i if its weight in the concept vector c i is greater then its weights in the other concept vectors Each word is assigned to only one subset i Represent the documents in i using only the features in the corresponding word cluster W i Compute the partition of i into v i clusters and compute the word vectors of each centroid
WI 2004 Pseudo-Supervised Clustering for Text Documents13 Pseudo CMD-2 Each partition i is represented by a set of v i directions obtained from the concept vectors c’ ij These vectors are orthogonal since each word belongs to only one c ij Document projection
WI 2004 Pseudo-Supervised Clustering for Text Documents14 Evaluation of cluster quality Experiments on pre-classified documents Measure of the dispersion of the classes among the clusters Contingency table: the matrix H, whose element h(A i,C j ) is the number of items with label A i assigned to the cluster C j. Accuracy “classification using majority voting” Conditional Entropy “confusion” in each cluster Human Evaluation
WI 2004 Pseudo-Supervised Clustering for Text Documents15 Experimental results-1 Data Preparation Parsing of PDF file Term filtering using the Aspell library Removal of the stop words Application of the Luhn Reduction to remove common words
WI 2004 Pseudo-Supervised Clustering for Text Documents16 Experimental result-2 Data Set (conference papers) N.NameN. File 1Fuzzy Control112 2Biologic. Evolutionary Comput.240 3Agent System118 4Global Brain Models171 5Wavelets Applications68 6Chaotic Systems70 7Neural Networks134 8Clustering and Classification86 9Image Analysis and Vision114 10PCM and SVM104
WI 2004 Pseudo-Supervised Clustering for Text Documents17 Experimental result-3 We applied k-means using three different document representations: original vocabulary basis Pseudo-SVD (PSVD) Pseudo-CMD (PCMD) Each algorithm was applied setting the number of clusters to 10 For PSVD and PCMD, we varied the number of principal components
WI 2004 Pseudo-Supervised Clustering for Text Documents18 Experimental result-4 MethodsEntropyAccuracyHuman evaluated PCMD PCMD PCMD PSVD PSVD PSVD K-means
WI 2004 Pseudo-Supervised Clustering for Text Documents19 Experimental result-5 Topic Distribution for the Pseudo-SVD algorithm with v=7
WI 2004 Pseudo-Supervised Clustering for Text Documents20 Experimental result-6 Analyzing the results: Low Accuracy High Entropy Due to: Data set has many transversal topics (for es. Class 5 ->Wavelets) We have evaluated the accuracy using the expert’s evaluations.
WI 2004 Pseudo-Supervised Clustering for Text Documents21 Experimental result-7 Human expert’s evaluation of cluster accuracy
WI 2004 Pseudo-Supervised Clustering for Text Documents22 Conclusions We have presented two clustering algorithms for text documents which use a clustering step also in definition of the basis for the document representation We can exploit the prior knowledge of human expert about the data set and bias the feature reduction step towards a more significant representation The results show that PSVD algorithm is able to perform better than vocabulary Tf-Idf representation and PCMD
WI 2004 Pseudo-Supervised Clustering for Text Documents23 Thanks for your attention!!!
WI 2004 Pseudo-Supervised Clustering for Text Documents24 Appendix: Vector Space Model-2 Cosine correlation Two vectors x i and x j are similar if:
WI 2004 Pseudo-Supervised Clustering for Text Documents25 Appendix: Contingency and Confusion Matrix If you associate the cluster C j to the topic A m(j) for which C j has the maximum number of documents and you rearrange the column of H such that j’=m(j), you obtain the confusion matrix F m
WI 2004 Pseudo-Supervised Clustering for Text Documents26 Appendix: Pseudo CMD-2 For each word-by-document matrix for cluster i, we keep only the components related to the words in the word cluster W j We sub-partition each new matrix to obtain more than one direction for each original partition
WI 2004 Pseudo-Supervised Clustering for Text Documents27 Appendix: Evaluation of cluster quality- 2 Accuracy Classification Error
WI 2004 Pseudo-Supervised Clustering for Text Documents28 Evaluation of cluster quality-3 Conditional Entropy Where
WI 2004 Pseudo-Supervised Clustering for Text Documents29 Thanks for your attention!!!
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Siena - Italy