A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz

Topics Introduction Algorithm Performance Observation Conclusion and Future Work

Introduction

Basic Concepts Text Mining : Detection of trends or patterns in text data Clustering : Grouping or classifying documents based on similarity of content

Clustering Manual Vs Automated Supervised Vs Unsupervised Hierarchical Vs Partitional

Clustering Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents Method : Nonnegative Matrix Factorization or NMF

Vector Space Model of Text Data Documents represented as n-dimensional vectors –n : terms in the dictionary –vector component : importance of term Document collection represented as term-by- document matrix

Term-by-Document Matrix Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the) Document 1 : a quick brown fox Document 2 : jumped over the lazy dog

Term-by-Document Matrix

Clustering Method : NMF Low rank approximation of large sparse matrices Preserves data nonnegativity Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)

Other Methods Other rank reduction methods : –Principal Component Analysis (PCA) –Vector Quantization (VQ) Produce basis vectors with negative entries Additive and Subtractive combinations of basis vectors yield original document vectors

NMF Produces nonnegative basis vectors Additive combination of basis vectors yield original document vector

Term-by-Document Matrix (all entries nonnegative)

NMF Basis vectors interpreted as semantic features or topics Documents clustered on the basis of shared features

NMF Demonstrated by Xu et. Al (2003): – Outperforms Singular Value Decomposition (SVD) –Comparable to Graph Partitioning methods

Algorithm

NMF : Definition Given S : Document collection V mxn : term-by-document matrix m : terms in the dictionary n : Number of documents in S

NMF : Definition NMF is defined as: Low rank approximation of V mxn in terms of some metric Factor V into the product WH –W mxk : Contains basis vectors –H kxn : Contains linear combinations –k : Selected number of topics or basis vectors, k << min(m,n)

NMF : Common Approach Minimize objective function:

NMF : Existing Methods Multiplicative Method (MM) [ by Lee and Seung ] Based on Multiplicative update rules || V - WH || is monotonically non- increasing and constant iff W, H at stationary point Version of Gradient Descent (GD) optimization scheme

NMF : Existing Methods Sparse Encoding [ by Hoyer ] Based on study of neural networks Enforces statistical sparsity of H –Minimizes sum of non-zeros in H

NMF : Existing Methods Sparse Encoding [ by Mu, Plemmons and Santago ] Similar to Hoyer’s method Enforces statistical sparsity of H using a regularization parameter –Minimizes number of non-zeros in H

NMF : Proposed Algorithm Hybrid Method: W approximated using Multiplicative Method H calculated using a Constrained Least Square (CLS) model as the metric – Penalizes the number of non-zeros – Similar to the method by Mu, Plemmons and Santago Called GD-CLS

GD-CLS

Performance

Text Collections Used Two benchmark topic detection text collections: –Reuters : Collection of documents on assorted topics –TDT2 : Transcripts from news media

Text Collections Used

Accuracy Metric Defined by: d i : Document number i = 1 = 1 if the topic labels match ∂(d i ) = 0 otherwise k = 2, 4, 6, 8, 10, 15, 20 λ = 0.1, 0.01, 0.001

Results for Reuters Results for TDT2

Observations

Observations : AC AC inversely proportional to k Nature of the collection affects AC –Reuters : earn, interest, cocoa –TDT2 : Asian economic crisis, Oprah lawsuit

Observations : λ parameter AC declines as λ increases ( mostly effective for homogeneous text collections) : CPU time declines as λ increases

Observations : Cluster size Imbalance in cluster sizes has adverse effect :

Conclusion & Future Work GD-CLS can be used to effectively cluster text data. Further development involves: Smart updating Use in Bioinformatics Develop user-interface Convert to C++

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.

Similar presentations

Presentation on theme: "A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.

Similar presentations

Presentation on theme: "A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz."— Presentation transcript:

Similar presentations

About project

Feedback