Download presentation
Presentation is loading. Please wait.
Published byShannon Heath Modified over 9 years ago
1
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz
2
Topics Introduction Algorithm Performance Observation Conclusion and Future Work
3
Introduction
4
Basic Concepts Text Mining : Detection of trends or patterns in text data Clustering : Grouping or classifying documents based on similarity of content
5
Clustering Manual Vs Automated Supervised Vs Unsupervised Hierarchical Vs Partitional
6
Clustering Objective: Automated Unsupervised Partitional Clustering of Text Data or Documents Method : Nonnegative Matrix Factorization or NMF
7
Vector Space Model of Text Data Documents represented as n-dimensional vectors –n : terms in the dictionary –vector component : importance of term Document collection represented as term-by- document matrix
8
Term-by-Document Matrix Terms in the dictionary, n : 9 (a, brown, dog, fox, jumped, lazy, over, quick, the) Document 1 : a quick brown fox Document 2 : jumped over the lazy dog
9
Term-by-Document Matrix
10
Clustering Method : NMF Low rank approximation of large sparse matrices Preserves data nonnegativity Introduces the concept of parts-based representation (by Lee and Seung in Nature, 1999)
11
Other Methods Other rank reduction methods : –Principal Component Analysis (PCA) –Vector Quantization (VQ) Produce basis vectors with negative entries Additive and Subtractive combinations of basis vectors yield original document vectors
12
NMF Produces nonnegative basis vectors Additive combination of basis vectors yield original document vector
13
Term-by-Document Matrix (all entries nonnegative)
14
NMF Basis vectors interpreted as semantic features or topics Documents clustered on the basis of shared features
15
NMF Demonstrated by Xu et. Al (2003): – Outperforms Singular Value Decomposition (SVD) –Comparable to Graph Partitioning methods
16
Algorithm
17
NMF : Definition Given S : Document collection V mxn : term-by-document matrix m : terms in the dictionary n : Number of documents in S
18
NMF : Definition NMF is defined as: Low rank approximation of V mxn in terms of some metric Factor V into the product WH –W mxk : Contains basis vectors –H kxn : Contains linear combinations –k : Selected number of topics or basis vectors, k << min(m,n)
19
NMF : Common Approach Minimize objective function:
20
NMF : Existing Methods Multiplicative Method (MM) [ by Lee and Seung ] Based on Multiplicative update rules || V - WH || is monotonically non- increasing and constant iff W, H at stationary point Version of Gradient Descent (GD) optimization scheme
21
NMF : Existing Methods Sparse Encoding [ by Hoyer ] Based on study of neural networks Enforces statistical sparsity of H –Minimizes sum of non-zeros in H
22
NMF : Existing Methods Sparse Encoding [ by Mu, Plemmons and Santago ] Similar to Hoyer’s method Enforces statistical sparsity of H using a regularization parameter –Minimizes number of non-zeros in H
23
NMF : Proposed Algorithm Hybrid Method: W approximated using Multiplicative Method H calculated using a Constrained Least Square (CLS) model as the metric – Penalizes the number of non-zeros – Similar to the method by Mu, Plemmons and Santago Called GD-CLS
24
GD-CLS
25
Performance
26
Text Collections Used Two benchmark topic detection text collections: –Reuters : Collection of documents on assorted topics –TDT2 : Transcripts from news media
27
Text Collections Used
28
Accuracy Metric Defined by: d i : Document number i = 1 = 1 if the topic labels match ∂(d i ) = 0 otherwise k = 2, 4, 6, 8, 10, 15, 20 λ = 0.1, 0.01, 0.001
29
Results for Reuters Results for TDT2
30
Observations
31
Observations : AC AC inversely proportional to k Nature of the collection affects AC –Reuters : earn, interest, cocoa –TDT2 : Asian economic crisis, Oprah lawsuit
32
Observations : λ parameter AC declines as λ increases ( mostly effective for homogeneous text collections) : CPU time declines as λ increases
33
Observations : Cluster size Imbalance in cluster sizes has adverse effect :
34
Conclusion & Future Work GD-CLS can be used to effectively cluster text data. Further development involves: Smart updating Use in Bioinformatics Develop user-interface Convert to C++
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.