Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Unsupervised learning
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
A Probabilistic Framework for Semi-Supervised Clustering
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Clustering II.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
Proceedings of the 2007 SIAM International Conference on Data Mining.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Radial Basis Function Networks
Clustering Unsupervised learning Generating “classes”
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Machine Learning Problems Unsupervised Learning – Clustering – Density estimation – Dimensionality Reduction Supervised Learning – Classification – Regression.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering Hongyuan Zha Department of Computer Science.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
Text Clustering Hongning Wang
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
K -means clustering via Principal Component Analysis (Chris Ding and Xiaofeng He, ICML 2004) 03 March 2011 Kwak, Namju 1.
K-Means and variants Rahul K Mishra Guide: Prof. G. Ramakrishnan.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Data Mining and Text Mining. The Standard Data Mining process.
Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
CSE 554 Lecture 8: Alignment
Clustering Clustering definition: Partition a given set of objects into M groups (clusters) such that the objects of each group are ‘similar’ and ‘different’
Semi-Supervised Clustering
Document Clustering Based on Non-negative Matrix Factorization
Constrained Clustering -Semi Supervised Clustering-
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer Science Wayne State University Detroit, MI48202 {chenyanh, rege, mdong, jinghua,

Outline Introduction Overview of related work Semi-Supervised Non-negative Matrix Factorization (SS- NMF) for document clustering Theoretical result for SS-NMF Experiments and results Conclusion

What is clustering? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

Document Clustering Grouping of text documents into meaningful clusters in an unsupervised manner. Government Science Arts

Unsupervised Clustering Example

Semi-supervised clustering: problem definition Input: –A set of unlabeled objects –A small amount of domain knowledge (labels or pairwise constraints) Output: –A partitioning of the objects into k clusters Objective: –Maximum intra-cluster similarity –Minimum inter-cluster similarity –High consistency between the partitioning and the domain knowledge

According to different given domain knowledge: –Users provide class labels (seeded points) a priori to some of the documents –Users know about which few documents are related (must-link) or unrelated (cannot-link) Semi-Supervised Clustering Seeded points Must-link Cannot-link

Why semi-supervised clustering? Large amounts of unlabeled data exists –More is being produced all the time Expensive to generate Labels for data –Usually requires human intervention Use human input to provide labels for some of the data –Improve existing naive clustering methods –Use labeled data to guide clustering of unlabeled data –End result is a better clustering of data Potential applications –Document/word categorization –Image categorization –Bioinformatics (gene/protein clustering)

Outline Introduction Overview of related work Semi-supervised Non-negative Matrix Factorization (SS- NMF) for document clustering Theoretical work for SS-NMF Experiments and results Conclusion

Clustering Algorithm Document hierarchical clustering –Bottom-up, agglomerative –Top-down, divisive Document partitioning (flat clustering) –K-means –probabilistic clustering using the Naïve Bayes or Gaussian mixture model, etc. Document clustering based on graph model

Semi-supervised Clustering Algorithm Semi-supervised Clustering with labels (Partial label information is given ) : –SS-Seeded-Kmeans ( Sugato Basu, et al. ICML 2002) -SS-Constraint-Kmeans ( Sugato Basu, et al. ICML 2002) Semi-supervised Clustering with Constraints (Pairwise Constraints (Must-link, Cannot-link) is given): –SS-COP-Kmeans (Wagstaff et al. ICML01) –SS-HMRF-Kmeans (Sugato Basu, et al. ACM SIGKDD 2004) –SS-Kernel-Kmeans (Brian Kulis, et al. ICML 2005) –SS-Spectral-Normalized-Cuts (X. Ji, et al. ACM SIGIR 2006)

Overview of K-means Clustering K-means is a partition clustering algorithm based on iterative relocation that partitions a dataset into k clusters. Objective function: Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: Algorithm: Initialize k cluster centers randomly. Repeat until convergence: –Cluster Assignment Step: Assign each data point x i to the cluster f h such that distance of x i from center of f h is minimum –Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster

Semi-supervised Kernel K-means (SS-KK) [Brian Kulis, et al. ICML 2005] Semi-supervised Kernel K-means algorithm : where is kernel function mapping from, is centroid, is the cost of violating the constraint between two points –First term: kernel k-means objective function –Second term: reward function for satisfying must-link constraints –Third term: penalty function for violating cannot-link constraints

Overview of Spectral Clustering Spectral clustering is a graph-theoretic clustering algorithm Weighted Graph G=(V, E, A) min between-cluster similarities (weights : A ij )

Spectral Normalized Cuts Min similarity between & : Balance weights: Cluster indicator: Graph partition becomes: Solution is eigenvector of:

Semi-supervised Spectral Normalized Cuts (SS-SNC) [X. Ji, et al. ACM SIGIR 2006] Semi-supervised Spectral Learning algorithm : where, –First term: spectral normalized cut objective function –Second term: reward function for satisfying must-link constraints –Third term: penalty function for violating cannot-link constraints

Outline Introduction Related work Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for document clustering –NMF review –Model formulation and algorithm derivation Theoretical result for SS-NMF Experiments and results Conclusion

Non-negative Matrix Factorization (NMF) NMF is to decompose matrix into two parts( D. Lee et al., Nature 1999 ) Symmetric NMF for clustering ( C. Ding et al. SIAM ICDM 2005 ) XF G ~=~= min || X – FG T || 2 ~=~= xx min || A – GSG T || 2

SS-NMF Incorporate prior knowledge into NMF based framework for document clustering. Users provide pairwise constraints: –Must-link constraints C ML : two documents d i and d j must belong to the same cluster. –Cannot-link constraints C CL : two documents d i and d j must belong to the different cluster. Constraints are defined by associated violation cost matrix W: –W reward : cost of violating the constraint between document d i and d j if a constraint exists. –W penalty : cost of violating the constraints between document d i and d j if a constraint exists.

SS-NMF Algorithm Define the objective function of SS-NMF: where is the cluster label of

Summary of SS-NMF Algorithm

Outline Introduction Overview of related work Semi-supervised Non-negative Matrix Factorization (SS- NMF) for document clustering Theoretical result for SS-NMF Experiments and results Conclusion

Algorithm Correctness and Convergence Based on constraint optimization theory, auxiliary function, we can prove SS-NMF: 1.Correctness: Solution converges to local minimum 2. Convergence: Iterative algorithm converges (Details in paper [1], [2]) [1] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October (Regular Paper, acceptance rate 7.2% ) [2] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, to appear, 2008.

SS-NMF: General Framework for Semi-supervised Clustering Proof: (1) (2) (3) Orthogonal Symmetric Semi-supervised NMF is equivalent to Semi-supervised Kernel K-means (SS-KK) and Semi-supervised Spectral Normalized Cuts (SS-SNC)!

Advantages of SS-NMF SS-KKSS-SNCSS-NMF Clustering Indicator Hard clustering Exact orthogonal The derived latent semantic space to be orthogonal No direct relationship between the singular vectors and the clusters Soft clustering Map the documents into non-negative latent semantic space which may not be orthogonal Cluster label can be determined by the axis with the largest projection value Time Complexity Iterative algorithm Solving a computationally expensive constrained eigen- decomposition Iterative algorithm to obtain partial answer at intermediate stages of the solution by specifying a fixed number of iterations Simple basic matrix computation and easily deployed over a distributed computing environment when dealing with large document collections.

Outline Introduction Overview of related work Semi-supervised Non-negative Matrix Factorization (SS-NMF) for document clustering Theoretical result for SS-NMF Experiments and results –Artificial Toy Data –Real Data Conclusion

Experiments on Toy Data 1. Artificial toy data: consisting of two natural clusters

Results on Toy Data (SS-KK and SS-NMF) Right Table: Difference between cluster indicator G of SS-KK (hard clustering) and SS-NMF (soft clustering) for the toy data Hard Clustering: Each object belongs to a single cluster Soft Clustering: Each object is probabilistically assigned to clusters.

Results on Toy Data (SS-SNC and SS-NMF) (b) Data distribution in the SS-NMF subspace of two column vectors of G. The data points from the two clusters get distributed along the two axes. (a) Data distribution in the SS-SNC subspace of the first two singular vectors. There is no relationship between the axes and the clusters.

Time Complexity Analysis Up Figure: Computational Speed comparison for SS-KK, SS- SNC and SS-NMF ( )

Experiments on Text Data 2. Summary of data sets [1] used in the experiments. [1] Evaluation Metric: where n is the total number of documents in the experiment, δis the delta function that equals one if, is the estimated label, is the ground truth.

Results on Text Data (Compare with Unsupervised Clustering) (1) Comparison with unsupervised clustering approaches: Note: SS-NMF adds 3% constraints

Results on Text Data (Before Clustering and After Clustering) (a) Typical document- document matrix before clustering (b) Document- document similarity matrix after clustering with SS-NMF (k=2) (c) Document- document similarity matrix after clustering with SS-NMF (k=5)

Results on Text Data (Clustering with Different Constraints) Left Table: Comparison of confusion matrix C and normalized cluster centroid matrix S of SS-NMF for different percentage of documents pairwise constrained

Results on Text Data (Compare with Semi-supervised Clustering) (2) Comparison with SS-KK and SS-SNC (a) Graft-Phos (b) England-Heart(c) Interest-Trade

Comparison with SS-KK and SS-SNC ( Fbis2, Fbis3, Fbis4, Fbis5) Results on Text Data (Compare with Semi-supervised Clustering)

Experiments on Image Data Up Figure: Sample images for images categorization. (From up to down: O-Owls, R-Roses, L-Lions, E-Elephants, H-Horses) 3. Image data sets [2] used in the experiments. [2]

Results on Image Data (Compare with Unsupervised Clustering) Up Table : Comparison of image clustering accuracy between KK, SNC, NMF and SS-NMF with only 3% pair-wise constraints on the images. It shows that SS-NMF consistently outperforms other well-established unsupervised image clustering methods. (1) Comparison with unsupervised clustering approaches:

Results on Image Data (Compare with Semi-supervised Clustering) (2) Comparison with SS-KK and SS-SNC: Left Figure: Comparison of image clustering accuracy between SS-KK, SS-SNC, and SS- NMF for different percentages of images pairs constrained (a) O-R, (b) L-H, (c) R-L, (d) O-R-L.

Results on Image Data (Compare with Semi-supervised Clustering) (2) Comparison with SS-KK and SS-SNC: Left Figure: Comparison of image clustering accuracy between SS-KK, SS- SNC, and SS-NMF for different percentages of images pairs constrained (e) L-E-H, (f) O-R-L-E, (g) O-L-E-H, (h) O-R-L- E-H

Outline Introduction Related work Semi-supervised Non-negative Matrix Factorization (SS- NMF) for document clustering Theoretical result for SS-NMF Experiments and results Conclusion

Semi-supervised Clustering: - many real world applications - outperform the traditional clustering algorithms Semi-supervised NMF algorithm provides a unified mathematic framework for semi-supervised clustering. Many existing semi-supervised clustering algorithms can be extended to achieve multi-type objects co-clustering tasks.

Reference [1] Y. Chen, M. Rege, M. Dong and F. Fotouhi, “Deriving Semantics for Image Clustering from Accumulated User Feedbacks”, Proc. of ACM Multimedia, Germany, [2] Y. Chen, M. Rege, M. Dong and J. Hua, “Incorporating User provided Constraints into Document Clustering”, Proc. of IEEE ICDM, Omaha, NE, October (Regular Paper, acceptance rate 7.2%) [3] Y. Chen, M. Rege, M. Dong and J. Hua, “Non-negative Matrix Factorization for Semi-supervised Data Clustering”, Journal of Knowledge and Information Systems, invited as a best paper of ICDM 07, to appear 2008.