Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu, Xin Liu, Yihong Gong Document Clustering Based On Non-negative Matrix Factorization ACM SIGIR,2003
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction The Proposed Method Performance Evaluations Conclusions Personal Opinion Review
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Traditional clustering method make harsh simplifying assumptions on the distribution of the document corpus to be clustered. There have been research that perform document clustering using the latent semantic indexing method (LSI) or using the spectral clustering based on graph partitioning theories. They all have some drawbacks.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Propose a novel document clustering method based on the non-negative factorization of the term- document matrix to improve above drawbacks.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD Document Representation Let be the complete vocabulary set of the document corpus. The term-frequency vector Xi of document di is defined as where t ji,idf j denote the term frequency of word f j in document di, the number of documents containing word f j.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD Using Xi as the i’th column, we construct the m*n term-document matrix X. This matrix will be used to conduct the non-negative fractorization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD Document Cluster Based on NMF NMF is a matrix factorization algorithm that finds the positive factorization of a given positive matrix. Here the goal of NMF is to factorize X into non- negative m*k matrix U and the non-negative k*n matrix V T that minimize the following objective func.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD Here The is a typical constrained optimization problem, and can be solved using the Lagrange multiplier method. Let U=[u ij ], V=[v ij ].
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD Note that solution to minimizing the criterion function J is not unique. If U and V are the solution to J, then, UD,VD-1 will also form a solution for any positive diagonal matrix D. To make the solution unique, we further require that the Euclidean length of the column vector in matrix U is one.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD Each element u ij of matrix U represents the degree to which term fi belong to cluster j Each element v ij of matrix V indicates to which degree document i is associated with cluster j
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD The algorithm is composed of the following steps:
Intelligent Database Systems Lab N.Y.U.S.T. I. M. The PROPOSED METHOD NMF VS SVD
Intelligent Database Systems Lab N.Y.U.S.T. I. M. PERFORMANCE EVALUATION Data Corpora
Intelligent Database Systems Lab N.Y.U.S.T. I. M. PERFORMANCE EVALUATION Evaluation Metrics Accuracy (AC) VS Mutual information (MI) where n denotes the total number of documents l i and α i be the cluster label and the label provided by the document corpus.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. PERFORMANCE EVALUATION Mutual information MI(C,C’) takes values between 0 and max(H(C),H(C’)) where H(C) and H(C’) are the entropies of C and C’.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. PERFORMANCE EVALUATION Performance Evaluations and Comparisons
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions The important benefit of our algorithm is that 1. Each axis in the space derived by the NMF has a much more straightforward correspondence with each document cluster than in the space derived by the SVD. 2. document clustering results can be directly derived without additional clustering operations. 3. document clustering accuracy is higher than other document clustering methods.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion ……
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Review