Concept Decomposition for Large Sparse Text Data Using Clustering

Concept Decomposition for Large Sparse Text Data Using Clustering
Dhillon, I. S. and Modha, D. S. Machine Learning, 42(1), 2001 Nov., 9, 2001 Summarized by Jeong-Ho Chang

Introduction Study a certain spherical k-means algorithm for clustering document vectors. Empirically demonstrate that the clusters produced have a certain “fractal-like” and “self similar” behavior. Matrix approximation by concept decomposition: explore intimate connections between clustering using the spherical k-means algorithm and the problem of matrix approximation for the word-by-document matrices.

Vector Space Model for Text
term weighting component depends on the number of occurrences of words j in document i . global weighting component depends on the number of documents which contain the word j. normalization component

The Spherical k-means Algorithm

Concept Vectors Cosine Similarity Concept Vectors concept vector
mean vector

Spherical k-means (1/4) Objective function Optimal Partitioning
Measurement for “coherence” or “quality” of each cluster

Spherical k-means (2/4)

Spherical k-means (3/4): Convergence
Monotone:

Spherical k-means (4/4): Convergence
Bound: the limit exists Does not imply that underlying partitioning converges.

Experimental Results (1/4)
Data set CLASSIC3 data set 3,893 documents MEDLINE(1033), CISI(1460), CRANFIELD(1400) 4,099 words after preprocessing Use only term frequency. NSF data set 13,297 abstracts of the grants awarded by NSF 5,298 words after preprocessing Use term frequency and inverse document frequency.

Confusion matrix for CLASSIC3 data Objective function plot

Intra-cluster structure

Inter-cluster structure

Relation with Euclidean k-means Algorithms
Can be thought of as a matrix approximation problem

Matrix Approximation using Clustering

Clustering as Matrix Approximation
Formulation : word-by-document matrix : matrix approximation where i-th column is the concept vector closest to the xi . How effective is the approximation? Frobenius norm

Concept Decomposition (1/2)
Formulation concept decomposition as the least-squares approximation of X

Concept Decomposition (2/2)

Concept Vectors and Singular Vectors: A Comparison

Concept vectors are local and sparse (1/6)
Locality Three concept vectors for CLASSIC3 data

Three singular vectors for CLASSIC3 data

Four (among 10) concept vectors for NSF data

Four (among 10) singular vectors for NSF data

Sparsity With the increasing number of clusters, concept vectors become progressively sparser.

Orthonormality With increasing number of clusters, the concept vectors tend towards “orthonormality.”

Principal Angles: Comparing Concept and Singular subspaces (1/4)
Generalize the notion of an angle between two lines to higher-dimensional subspaces of Rd. Formulation F and G is subspaces of Rd. Average cosine of the principal angles

CLASSIC3 data set With singular subspace S3 With singular subspace S10

NSF data set (1/2) With singular subspace S64

NSF data set (2/2) With singular subspace S235

Conclusions Present spherical k-means algorithm for text documents
High-dimensional and sparse. Average cluster coherence tend to be quite low. There is a large void surrounding each concept vector. Uncommon for low-dimensional, dense data set. The concept decompositions that are derived from concept vectors can be used for matrix approximation. Comparable to that of truncated SVDs. The concept vectors constitute a powerful sparse and localized “basis” for text data sets.

Concept Decomposition for Large Sparse Text Data Using Clustering

Similar presentations

Presentation on theme: "Concept Decomposition for Large Sparse Text Data Using Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Concept Decomposition for Large Sparse Text Data Using Clustering

Similar presentations

Presentation on theme: "Concept Decomposition for Large Sparse Text Data Using Clustering"— Presentation transcript:

Similar presentations

About project

Feedback