Abdul Wahid, Xiaoying Gao, Peter Andreae

Name: Abdul Wahid, Xiaoying Gao, Peter Andreae
Uploaded: 2017-10-01T02:26:16+00:00
Duration: PTM6S12
Channel: Penelope Chapman
Description: Abdul Wahid, Xiaoying Gao, Peter Andreae

Abdul Wahid, Xiaoying Gao, Peter Andreae
A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria University of Wellington New Zealand

Soft subspace clustering
Clustering normally use all features Text data too many features Subspace clustering use subsets of features-----subspace Soft a feature has a weight in each subspace

Research questions What are the subspaces How to define the weights
Feature to subspace LDA (Latent Dirichlet Allocation) Topic modelling Automatically detects topics Solution Topics as subspace Weight: word probability in each topic

LDA: example by Edwin Chen
Suppose you have the following set of sentences, and you want two topics: I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast. Chinchillas and kittens are cute. My sister adopted a kitten yesterday. Look at this cute hamster munching on a piece of broccoli.

LDA example by Edwin Chen
Sentences 1 and 2: 100% Topic A Sentences 3 and 4: 100% Topic B Sentence 5: 60% Topic A, 40% Topic B Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

Apply LDA Gibbs Sampling Generate two matrices
Topic--Documents matrix 𝜃 Topic – term matrix 𝜙

Assign Initial Clusters
Documents 𝜃 Preprocessing LDA Gibbs Sampling Refine clusters Assign Weights 𝜙

Our DWKM algorithm K-mean based algorithm Use LDA to get two matrices
Use document-topic matrix to initialise the clusters Repeat Calculate the centroid of each cluster Assign each document to the nearest centroid The distance measure is weighted by term-topic matrix Until convergence

New distance measure Weights: word probability in a topic 𝜙xt

Hard Subspace Clustering Soft Subspace Clustering
Common approach Our new approach Randomly Assign feature weights LDA Semantic information Randomly assign documents to clusters Feature weighting Initial cluster estimation Refine feature weights Refine clusters using feature weights Refine clusters

Experiments Data sets Evaluation parameters Compared with
4 Synthetic datasets 6 Real data sets Evaluation parameters Accuracy F measure NMI (Normal Mutual Information) Entropy Compared with K-means, LDA as a clustering method, FWKM, EWKM, FGKM

Results datasets Metric K-means LDA FWKM EWKM FGKM DWKM SD1 Acc 0.65
0.66 0.77 0.69 0.82 0.87 F-M 0.63 0.73 0.59 0.75 0.81 SD2 0.68 0.76 0.72 0.92 0.64 0.88 SD3 0.62 0.67 0.70 0.94 0.91 SD4 0.60 0.61 0.93 0.58 0.90

Results

Conclusion A new soft subspace clustering algorithm
A new distance measure Apply LDA to get semantic information Improved performance

Future work Non-parametric LDA model Reduce computational complexity
No need to give number of topics Reduce computational complexity Use LDA to generate different candidate clustering solution for clustering ensembles.

Abdul Wahid, Xiaoying Gao, Peter Andreae

Similar presentations

Presentation on theme: "Abdul Wahid, Xiaoying Gao, Peter Andreae"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Abdul Wahid, Xiaoying Gao, Peter Andreae

Similar presentations

Presentation on theme: "Abdul Wahid, Xiaoying Gao, Peter Andreae"— Presentation transcript:

Similar presentations

About project

Feedback