Abdul Wahid, Xiaoying Gao, Peter Andreae A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria University of Wellington New Zealand
Soft subspace clustering Clustering normally use all features Text data too many features Subspace clustering use subsets of features-----subspace Soft a feature has a weight in each subspace
Research questions What are the subspaces How to define the weights Feature to subspace LDA (Latent Dirichlet Allocation) Topic modelling Automatically detects topics Solution Topics as subspace Weight: word probability in each topic
LDA: example by Edwin Chen Suppose you have the following set of sentences, and you want two topics: I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast. Chinchillas and kittens are cute. My sister adopted a kitten yesterday. Look at this cute hamster munching on a piece of broccoli.
LDA example by Edwin Chen Sentences 1 and 2: 100% Topic A Sentences 3 and 4: 100% Topic B Sentence 5: 60% Topic A, 40% Topic B Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)
Apply LDA Gibbs Sampling Generate two matrices Topic--Documents matrix 𝜃 Topic – term matrix 𝜙
Assign Initial Clusters Documents 𝜃 Preprocessing LDA Gibbs Sampling Refine clusters Assign Weights 𝜙
Our DWKM algorithm K-mean based algorithm Use LDA to get two matrices Use document-topic matrix to initialise the clusters Repeat Calculate the centroid of each cluster Assign each document to the nearest centroid The distance measure is weighted by term-topic matrix Until convergence
New distance measure Weights: word probability in a topic 𝜙xt
Hard Subspace Clustering Soft Subspace Clustering Common approach Our new approach Randomly Assign feature weights LDA Semantic information Randomly assign documents to clusters Feature weighting Initial cluster estimation Refine feature weights Refine clusters using feature weights Refine clusters
Experiments Data sets Evaluation parameters Compared with 4 Synthetic datasets 6 Real data sets Evaluation parameters Accuracy F measure NMI (Normal Mutual Information) Entropy Compared with K-means, LDA as a clustering method, FWKM, EWKM, FGKM
Results datasets Metric K-means LDA FWKM EWKM FGKM DWKM SD1 Acc 0.65 0.66 0.77 0.69 0.82 0.87 F-M 0.63 0.73 0.59 0.75 0.81 SD2 0.68 0.76 0.72 0.92 0.64 0.88 SD3 0.62 0.67 0.70 0.94 0.91 SD4 0.60 0.61 0.93 0.58 0.90
Results
Conclusion A new soft subspace clustering algorithm A new distance measure Apply LDA to get semantic information Improved performance
Future work Non-parametric LDA model Reduce computational complexity No need to give number of topics Reduce computational complexity Use LDA to generate different candidate clustering solution for clustering ensembles.