Abdul Wahid, Xiaoying Gao, Peter Andreae

Slides:



Advertisements
Similar presentations
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Advertisements

K-means Clustering Ke Chen.
Active Shape Models Suppose we have a statistical shape model –Trained from sets of examples How do we use it to interpret new images? Use an “Active Shape.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Mixture Models and the EM Algorithm
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:
Title: The Author-Topic Model for Authors and Documents
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
MIS 696A Final Presentation Victor Benjamin, Joey Buckman, Xiaobo Cao, Weifeng Li, Zirun Qi, Lee Spitzley, Yun Wang, Rich Yueh.
Generative Topic Models for Community Analysis
Caimei Lu et al. (KDD 2010) Presented by Anson Liang.
Statistical Models for Networks and Text Jimmy Foulds UCI Computer Science PhD Student Advisor: Padhraic Smyth.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
1. Social-Network Analysis Using Topic Models 2. Web Event Topic Analysis by Topic Feature Clustering and Extended LDA Model RMBI4310/COMP4332 Big Data.
Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction.
Text Clustering.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Today Ensemble Methods. Recap of the course. Classifier Fusion
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Topic Modeling using Latent Dirichlet Allocation
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Topic Models Discovering Annotating Comparing Referring Sampling Illustrating Representing John Unsworth, “Scholarly Primitives” “Scholarly Primitives”
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Dynamic Multi-Faceted Topic Discovery in Twitter Date : 2013/11/27 Source : CIKM’13 Advisor : Dr.Jia-ling, Koh Speaker : Wei, Chang 1.
Link Distribution on Wikipedia [0407]KwangHee Park.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
K means ++ and K means Parallel Jun Wang. Review of K means Simple and fast Choose k centers randomly Class points to its nearest center Update centers.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Clustering Usman Roshan CS 675. Clustering Suppose we want to cluster n vectors in R d into two groups. Define C 1 and C 2 as the two groups. Our objective.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Kuifei Yu, Baoxian Zhang, Hengshu Zhu,Huanhuan Cao, and Jilei Tian
Latent Dirichlet Analysis
Understanding Connections: Amazon Customer Reviews
Stochastic Optimization Maximization for Latent Variable Models
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Topic Models in Text Processing
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Abdul Wahid, Xiaoying Gao, Peter Andreae A Soft subspace clustering Method for Text Data using a Probability based Feature Weighting Scheme Abdul Wahid, Xiaoying Gao, Peter Andreae Victoria University of Wellington New Zealand

Soft subspace clustering Clustering normally use all features Text data too many features Subspace clustering use subsets of features-----subspace Soft a feature has a weight in each subspace

Research questions What are the subspaces How to define the weights Feature to subspace LDA (Latent Dirichlet Allocation) Topic modelling Automatically detects topics Solution Topics as subspace Weight: word probability in each topic

LDA: example by Edwin Chen Suppose you have the following set of sentences, and you want two topics: I like to eat broccoli and bananas. I ate a banana and spinach smoothie for breakfast. Chinchillas and kittens are cute. My sister adopted a kitten yesterday. Look at this cute hamster munching on a piece of broccoli.

LDA example by Edwin Chen Sentences 1 and 2: 100% Topic A Sentences 3 and 4: 100% Topic B Sentence 5: 60% Topic A, 40% Topic B Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food) Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (at which point, you could interpret topic B to be about cute animals)

Apply LDA Gibbs Sampling Generate two matrices Topic--Documents matrix 𝜃 Topic – term matrix 𝜙

Assign Initial Clusters Documents 𝜃 Preprocessing LDA Gibbs Sampling Refine clusters Assign Weights 𝜙

Our DWKM algorithm K-mean based algorithm Use LDA to get two matrices Use document-topic matrix to initialise the clusters Repeat Calculate the centroid of each cluster Assign each document to the nearest centroid The distance measure is weighted by term-topic matrix Until convergence

New distance measure Weights: word probability in a topic 𝜙xt

Hard Subspace Clustering Soft Subspace Clustering Common approach Our new approach Randomly Assign feature weights LDA Semantic information Randomly assign documents to clusters Feature weighting Initial cluster estimation Refine feature weights Refine clusters using feature weights Refine clusters

Experiments Data sets Evaluation parameters Compared with 4 Synthetic datasets 6 Real data sets Evaluation parameters Accuracy F measure NMI (Normal Mutual Information) Entropy Compared with K-means, LDA as a clustering method, FWKM, EWKM, FGKM

Results datasets Metric K-means LDA FWKM EWKM FGKM DWKM SD1 Acc 0.65 0.66 0.77 0.69 0.82 0.87 F-M 0.63 0.73 0.59 0.75 0.81 SD2 0.68 0.76 0.72 0.92 0.64 0.88 SD3 0.62 0.67 0.70 0.94 0.91 SD4 0.60 0.61 0.93 0.58 0.90

Results

Conclusion A new soft subspace clustering algorithm A new distance measure Apply LDA to get semantic information Improved performance

Future work Non-parametric LDA model Reduce computational complexity No need to give number of topics Reduce computational complexity Use LDA to generate different candidate clustering solution for clustering ensembles.