Download presentation
1
Abdul Wahid, Xiaoying Gao, Peter Andreae
A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria University of Wellington New Zealand
2
Soft subspace clustering
Clustering normally use all features Text data too many features Subspace clustering use subsets of features-----subspace Soft a feature has a weight in each subspace
3
Research questions What are the subspaces How to define the weights
Feature to subspace LDA (Latent Dirichlet Allocation) Topic modelling Automatically detects topics Solution Topics as subspace Weight: word probability in each topic
4
LDA: example by Edwin Chen
Suppose you have the following set of sentences, and you want two topics: I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast. Chinchillas and kittens are cute. My sister adopted a kitten yesterday. Look at this cute hamster munching on a piece of broccoli.
5
LDA example by Edwin Chen
Sentences 1 and 2: 100% Topic A Sentences 3 and 4: 100% Topic B Sentence 5: 60% Topic A, 40% Topic B Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
6
Apply LDA Gibbs Sampling Generate two matrices
Topic--Documents matrix 𝜃 Topic – term matrix 𝜙
7
Assign Initial Clusters
Documents 𝜃 Preprocessing LDA Gibbs Sampling Refine clusters Assign Weights 𝜙
8
Our DWKM algorithm K-mean based algorithm Use LDA to get two matrices
Use document-topic matrix to initialise the clusters Repeat Calculate the centroid of each cluster Assign each document to the nearest centroid The distance measure is weighted by term-topic matrix Until convergence
9
New distance measure Weights: word probability in a topic 𝜙xt
10
Hard Subspace Clustering Soft Subspace Clustering
Common approach Our new approach Randomly Assign feature weights LDA Semantic information Randomly assign documents to clusters Feature weighting Initial cluster estimation Refine feature weights Refine clusters using feature weights Refine clusters
11
Experiments Data sets Evaluation parameters Compared with
4 Synthetic datasets 6 Real data sets Evaluation parameters Accuracy F measure NMI (Normal Mutual Information) Entropy Compared with K-means, LDA as a clustering method, FWKM, EWKM, FGKM
12
Results datasets Metric K-means LDA FWKM EWKM FGKM DWKM SD1 Acc 0.65
0.66 0.77 0.69 0.82 0.87 F-M 0.63 0.73 0.59 0.75 0.81 SD2 0.68 0.76 0.72 0.92 0.64 0.88 SD3 0.62 0.67 0.70 0.94 0.91 SD4 0.60 0.61 0.93 0.58 0.90
13
Results
14
Conclusion A new soft subspace clustering algorithm
A new distance measure Apply LDA to get semantic information Improved performance
15
Future work Non-parametric LDA model Reduce computational complexity
No need to give number of topics Reduce computational complexity Use LDA to generate different candidate clustering solution for clustering ensembles.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.