Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Siena - Italy

WI 2004 Pseudo-Supervised Clustering for Text Documents2 Outline  Document representation  Pseudo-Supervised Clustering  Evaluation of cluster quality  Experimental results  Conclusions

WI 2004 Pseudo-Supervised Clustering for Text Documents3 Vector Space Model  Representation with a term-weight vector in the vocabulary space d i = [ w i,1, w i,2, w i,3, …, w i,v ]’  A commonly used weight scheme is TF-IDF  Documents are compared using the cosine correlation

WI 2004 Pseudo-Supervised Clustering for Text Documents4 Vector Space Model: Limitations  High dimensionality  Each term is an independent component in the document representation the semantic relationships between words are not considered  Many irrelevant features feature selection may be difficult especially for unsupervised tasks  Vectors are very sparse

WI 2004 Pseudo-Supervised Clustering for Text Documents5 Vector Space Model: Projection  Projection to a lower dimensional space Definition of a basis for the projection  Use of statistical properties of the word-by- document matrix on a given corpus  SVD decomposition (Latent Semantic Analysis)  Concept Matrix Decomposition [Dhillon & Modha, Machine Learning,2001]  Data partition + SVD/CMD for each partition (Partially) supervised partitioning

WI 2004 Pseudo-Supervised Clustering for Text Documents6 Singular Value Decomposition  SVD of the |V|x|D| word-by-document matrix (|D|>|V|) The orthonormal matrix U represents a basis for document representation The k columns corresponding to the largest singular values in ∑ form the basis for the projected space

WI 2004 Pseudo-Supervised Clustering for Text Documents7 Concept Matrix Decomposition-1  Use a basis which describes a set of k concepts represented by k reference term distributions  The projection into the concept space is obtained by solving

WI 2004 Pseudo-Supervised Clustering for Text Documents8 Concept Matrix Decomposition-2  The k concept vectors c i can be obtained as the normalized centroids of a partition of the document collection D D = {D 1, D 2, …., D k }  CMD exploits the prototypes of certain homogenous sets of documents

WI 2004 Pseudo-Supervised Clustering for Text Documents9 Pseudo-Supervised Clustering  Selection of the projection basis using a supervised partition of the document set Determine a partition  of a reference subset T of the document corpus Select a basis B i for each set  i in the partition using SVD/CMD Project the documents using the basis B=U i B i Apply a clustering algorithm to the document corpus represented using the basis B Eventually iterate refining the choice of the reference subset

WI 2004 Pseudo-Supervised Clustering for Text Documents10 Pseudo SVD-1  The SVD is computed for the documents in each subset  i in   The basis B i is composed of the v i left singular vectors U i  The new basis B is represented by the matrix

WI 2004 Pseudo-Supervised Clustering for Text Documents11 Pseudo SVD-2  The Pseudo-SVD representation of the word- by-document matrix of the corpus is the matrix Z* computed as  The projection requires the solution of a least mean square problem

WI 2004 Pseudo-Supervised Clustering for Text Documents12 Pseudo CMD-1  An orthogonal basis is computed as follows Compute the centroid (concept vector) of each subset  i in  Compute the word cluster for each concept vector  A word belongs to the word cluster W i of subset  i if its weight in the concept vector c i is greater then its weights in the other concept vectors  Each word is assigned to only one subset  i Represent the documents in  i using only the features in the corresponding word cluster W i Compute the partition of  i into v i clusters and compute the word vectors of each centroid

WI 2004 Pseudo-Supervised Clustering for Text Documents13 Pseudo CMD-2  Each partition  i is represented by a set of v i directions obtained from the concept vectors c’ ij These vectors are orthogonal since each word belongs to only one c ij  Document projection

WI 2004 Pseudo-Supervised Clustering for Text Documents14 Evaluation of cluster quality  Experiments on pre-classified documents  Measure of the dispersion of the classes among the clusters Contingency table: the matrix H, whose element h(A i,C j ) is the number of items with label A i assigned to the cluster C j. Accuracy  “classification using majority voting” Conditional Entropy  “confusion” in each cluster Human Evaluation

WI 2004 Pseudo-Supervised Clustering for Text Documents15 Experimental results-1  Data Preparation Parsing of PDF file Term filtering using the Aspell-0.50.4.1 library Removal of the stop words Application of the Luhn Reduction to remove common words

WI 2004 Pseudo-Supervised Clustering for Text Documents16 Experimental result-2  Data Set (conference papers) N.NameN. File 1Fuzzy Control112 2Biologic. Evolutionary Comput.240 3Agent System118 4Global Brain Models171 5Wavelets Applications68 6Chaotic Systems70 7Neural Networks134 8Clustering and Classification86 9Image Analysis and Vision114 10PCM and SVM104

WI 2004 Pseudo-Supervised Clustering for Text Documents17 Experimental result-3  We applied k-means using three different document representations: original vocabulary basis Pseudo-SVD (PSVD) Pseudo-CMD (PCMD)  Each algorithm was applied setting the number of clusters to 10  For PSVD and PCMD, we varied the number of principal components

WI 2004 Pseudo-Supervised Clustering for Text Documents18 Experimental result-4 MethodsEntropyAccuracyHuman evaluated PCMD 40.71050.37430.4199 PCMD 70.70840.33870.4110 PCMD 100.70930.28920.3358 PSVD 40.64490.47190.6609 PSVD 70.62290.48380.7097 PSVD 100.61150.46910.6874 K-means0.68620.38950.5731

WI 2004 Pseudo-Supervised Clustering for Text Documents19 Experimental result-5 Topic Distribution for the Pseudo-SVD algorithm with v=7

WI 2004 Pseudo-Supervised Clustering for Text Documents20 Experimental result-6  Analyzing the results: Low Accuracy High Entropy  Due to: Data set has many transversal topics (for es. Class 5 ->Wavelets) We have evaluated the accuracy using the expert’s evaluations.

WI 2004 Pseudo-Supervised Clustering for Text Documents21 Experimental result-7 Human expert’s evaluation of cluster accuracy

WI 2004 Pseudo-Supervised Clustering for Text Documents22 Conclusions  We have presented two clustering algorithms for text documents which use a clustering step also in definition of the basis for the document representation  We can exploit the prior knowledge of human expert about the data set and bias the feature reduction step towards a more significant representation  The results show that PSVD algorithm is able to perform better than vocabulary Tf-Idf representation and PCMD

WI 2004 Pseudo-Supervised Clustering for Text Documents23 Thanks for your attention!!!

WI 2004 Pseudo-Supervised Clustering for Text Documents24 Appendix: Vector Space Model-2  Cosine correlation  Two vectors x i and x j are similar if:

WI 2004 Pseudo-Supervised Clustering for Text Documents25 Appendix: Contingency and Confusion Matrix  If you associate the cluster C j to the topic A m(j) for which C j has the maximum number of documents and you rearrange the column of H such that j’=m(j), you obtain the confusion matrix F m

WI 2004 Pseudo-Supervised Clustering for Text Documents26 Appendix: Pseudo CMD-2  For each word-by-document matrix for cluster i, we keep only the components related to the words in the word cluster W j  We sub-partition each new matrix to obtain more than one direction for each original partition

WI 2004 Pseudo-Supervised Clustering for Text Documents27 Appendix: Evaluation of cluster quality- 2  Accuracy  Classification Error

WI 2004 Pseudo-Supervised Clustering for Text Documents28 Evaluation of cluster quality-3  Conditional Entropy Where

WI 2004 Pseudo-Supervised Clustering for Text Documents29 Thanks for your attention!!!

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Siena Siena - Italy

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Similar presentations

Presentation on theme: "Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Similar presentations

Presentation on theme: "Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università."— Presentation transcript:

Similar presentations

About project

Feedback