Download presentation
Presentation is loading. Please wait.
Published byBridget Sharlene Patrick Modified over 6 years ago
1
Concept Decomposition for Large Sparse Text Data Using Clustering
Dhillon, I. S. and Modha, D. S. Machine Learning, 42(1), 2001 Nov., 9, 2001 Summarized by Jeong-Ho Chang
2
Introduction Study a certain spherical k-means algorithm for clustering document vectors. Empirically demonstrate that the clusters produced have a certain “fractal-like” and “self similar” behavior. Matrix approximation by concept decomposition: explore intimate connections between clustering using the spherical k-means algorithm and the problem of matrix approximation for the word-by-document matrices.
3
Vector Space Model for Text
term weighting component depends on the number of occurrences of words j in document i . global weighting component depends on the number of documents which contain the word j. normalization component
4
The Spherical k-means Algorithm
5
Concept Vectors Cosine Similarity Concept Vectors concept vector
mean vector
6
Spherical k-means (1/4) Objective function Optimal Partitioning
Measurement for “coherence” or “quality” of each cluster
7
Spherical k-means (2/4)
8
Spherical k-means (3/4): Convergence
Monotone:
9
Spherical k-means (4/4): Convergence
Bound: the limit exists Does not imply that underlying partitioning converges.
10
Experimental Results (1/4)
Data set CLASSIC3 data set 3,893 documents MEDLINE(1033), CISI(1460), CRANFIELD(1400) 4,099 words after preprocessing Use only term frequency. NSF data set 13,297 abstracts of the grants awarded by NSF 5,298 words after preprocessing Use term frequency and inverse document frequency.
11
Experimental Results (2/4)
Confusion matrix for CLASSIC3 data Objective function plot
12
Experimental Results (3/4)
Intra-cluster structure
13
Experimental Results (4/4)
Inter-cluster structure
14
Relation with Euclidean k-means Algorithms
Can be thought of as a matrix approximation problem
15
Matrix Approximation using Clustering
16
Clustering as Matrix Approximation
Formulation : word-by-document matrix : matrix approximation where i-th column is the concept vector closest to the xi . How effective is the approximation? Frobenius norm
17
Concept Decomposition (1/2)
Formulation concept decomposition as the least-squares approximation of X
18
Concept Decomposition (2/2)
19
Concept Vectors and Singular Vectors: A Comparison
20
Concept vectors are local and sparse (1/6)
Locality Three concept vectors for CLASSIC3 data
21
Concept vectors are local and sparse (2/6)
Three singular vectors for CLASSIC3 data
22
Concept vectors are local and sparse (3/6)
Four (among 10) concept vectors for NSF data
23
Concept vectors are local and sparse (4/6)
Four (among 10) singular vectors for NSF data
24
Concept vectors are local and sparse (5/6)
Sparsity With the increasing number of clusters, concept vectors become progressively sparser.
25
Concept vectors are local and sparse (6/6)
Orthonormality With increasing number of clusters, the concept vectors tend towards “orthonormality.”
26
Principal Angles: Comparing Concept and Singular subspaces (1/4)
Generalize the notion of an angle between two lines to higher-dimensional subspaces of Rd. Formulation F and G is subspaces of Rd. Average cosine of the principal angles
27
Principal Angles: Comparing Concept and Singular subspaces (2/4)
CLASSIC3 data set With singular subspace S3 With singular subspace S10
28
Principal Angles: Comparing Concept and Singular subspaces (3/4)
NSF data set (1/2) With singular subspace S64
29
Principal Angles: Comparing Concept and Singular subspaces (4/4)
NSF data set (2/2) With singular subspace S235
30
Conclusions Present spherical k-means algorithm for text documents
High-dimensional and sparse. Average cluster coherence tend to be quite low. There is a large void surrounding each concept vector. Uncommon for low-dimensional, dense data set. The concept decompositions that are derived from concept vectors can be used for matrix approximation. Comparable to that of truncated SVDs. The concept vectors constitute a powerful sparse and localized “basis” for text data sets.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.