A Probabilistic Framework for Semi-Supervised Clustering

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Clustering II.
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Categorical Data The Case of Quran Verses
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Supervised Learning Recap
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
K-means clustering Hongning Wang
Machine Learning and Data Mining Clustering
Overview Full Bayesian Learning MAP learning
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Spatial Semi- supervised Image Classification Stuart Ness G07 - Csci 8701 Final Project 1.
Incorporating User Provided Constraints into Document Clustering Yanhua Chen, Manjeet Rege, Ming Dong, Jing Hua, Farshad Fotouhi Department of Computer.
Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Proceedings of the 2007 SIAM International Conference on Data Mining.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Radial Basis Function Networks
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
First topic: clustering and pattern recognition Marc Sobel.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
CSE 517 Natural Language Processing Winter 2015
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
DB Seminar Series: The Semi-supervised Clustering Problem By: Kevin Yip 2 Jan 2004 (Happy New Year!)
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
K-Means and variants Rahul K Mishra Guide: Prof. G. Ramakrishnan.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Constrained Clustering -Semi Supervised Clustering-
Machine Learning Lecture 9: Clustering
Mikhail Bilenko, Sugato Basu, Raymond J. Mooney
Semi-supervised Affinity Propagation
Jianping Fan Dept of Computer Science UNC-Charlotte
Algorithms for Budget-Constrained Survivable Topology Design
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Unsupervised Learning: Clustering
Clustering.
Presentation transcript:

A Probabilistic Framework for Semi-Supervised Clustering Sugato Basu Mikhail Bilenko Raymond J. Mooney Deptment of Computer Sciences University of Texas at Austin Presented by Jingting Zeng

Outline Introduction Background Algorithm Experiments Conclusion

What is Semi-Supervised Clustering? Use human input to provide labels for some of the data Improve existing naive clustering methods Use labeled data to guide clustering of unlabeled data End result is a better clustering of data

Motivation Large amounts of unlabeled data exists More is being produced all the time Expensive to generate Labels for data Usually requires human intervention Want to use limited amounts of labeled data to guide the clustering of the entire dataset in order to provide a better clustering method

Semi-Supervised Clustering Constraint-Based: Modify objective functions so that it includes satisfying constraints, enforcing constraints during the clustering or initialization process Distance-Based: A distance function is used that is trained on the supervised dataset to satisfy the labels or constraint, and then is applied to the complete dataset

Method Use both constraint-based and distance based approaches in a unified method Use Hidden Markov Random Fields to generate constraints Use constraints for initialization and assignment of points to clusters Use an adaptive distance function to try to learn the distance measure Cluster data to minimize some objective distance function

Main Points of Method Improved initialization: Initial clusters are formed based on constraints Constraint sensitive assignment of points to clusters Points are assigned to clusters to minimize a distortion function, while minimizing the number of constraints violated Iterative Distance Learning Distortion measure is re-estimated after each iteration

Constraints Pairwise constraints of Must-link, or Cannot-link labels Set M of must link constraints Set C of cannot link constraints A list of associated costs for violating Must-link or cannot-link requirements Class labels do not have to be known, but a user can still specify relationship between points.

HMRF

Posterior probability This problem is an “incomplete-data problem” Cluster representative as well as class labels are unknown Popular method for solving this type of problem is Expectation Maximization K-Means is equivalent to an EM algorithm with hard clustering assignments

Must-Link Violation Cost Function Ensures that the penalty for violating the must-link constraint between 2 points that are far apart is higher than between 2 points that are close Punishes distance functions in which must-link points are far apart

Cannot-Link Violation Cost Function Ensures that the penalty for violating cannot-link constraints between points that are nearby according to the current distance function is higher than between distant points Punishes distance functions that place 2 cannot link points close together

Objective Function Goal: find minimum objective function Supervised data in initialization Constraints in cluster assignments Distance learning in M step

Algorithm EM framework Initialization step: E step: M step: Use constraints to guide initial cluster formations E step: minimizes objective function over cluster assignment M step: Minimizes objective function over cluster representatives Minimizes objective function over the parameter distortion measure

Initialization Initialize: Form transitive closure of the must-link constraints Set of connected components consisting of points connected by must-link constraints y connected components If y < K (number of clusters), y connected neighborhoods used to create y initial clusters Remaining clusters initialized by random perturbations of the global centroid of the data

What If More Neighborhoods Then Clusters? If y > K (number of clusters), k initial clusters are selecting using the distance measure Farthest first traversal is a good heuristic Weighted variant of farthest first traversal Distance between 2 centroids multiplied by their corresponding weights Weight of each centroid is proportional to the size of the corresponding neighborhood. Biased to select centroids that are relatively far apart and of a decent size

Initialization Continued Assuming consistency of data: Augment set M with must-link constraints inferred from transitive closure For each pair of neighborhoods Np ,Np’ that have at least one cannont link constraint between them, add connot-link constraints between every member of Np and Np’ . Learn as much about the data through the constraints as possible

Augment set M a b c Must-link Must-link Inferred Must-link

Augment Set C a b c Must-link Cannot-link Inferred Cannot-link

E-step Assignments of data points to clusters Since model incorporates interactions between points, computing point assignments to minimize objective function is computationally intractable Iterated conditional models, belief propagation, and linear programming relaxation ICM: uses greedy strategy to sequentially update the cluster assignment of each point while keeping other points fixed.

M-step First, cluster representatives are re-estimated to decrease objective function Constraints do not factor into this step, so it is equivalent to K-Means If parameterized variant of a distance measure is used, it is updated here Parameters can be found through partial derivatives of distance function Learning step results in modifying the distortion measure so that similar points are brought closer together, while dissimilar points are pulled apart

Results KMeans – I – C – D : KMEANS – I- C: KMEANS –I : Complete HMRF- KMeans algorithm, with supervised data in initialization(I) and cluster assignment(C) and distance learning(D) KMEANS – I- C: HMRF-KMeans algorithm without distance learning KMEANS –I : HMRF-KMeans algorithm without distance learning and supervised cluster assignment

Results

Results

Results

Conclusion HMRF-KMeans Performs well (compared to naïve K-Means) with a limited number of constraints The goal of the algorithm was to provide a better clustering method with the use of a limited number of constraints HMRF-KMeans learns quickly from a limited number of constraints Should be applicable to data sets where we want to limit the amount of labeling needed to be done by humans, and constraints can be specified in pairwise labels

Questions Can all types of constraints be captured in pairwise associations? Hierarchal structure? Could other types of labels be included in this model? Use class labels as well as pairwise constraints How does this model handle noise in the data/labels? Point A has must link constraint to Point B, Point B has must list constraint to Point C, Point A has Cannot-link constraint to point C

More Questions How does this apply to other types of Data? Authors mention wanting to try applying method to other types of data in the future, such as gene representation Who provides weights for function violations, and how are weights determined? Only compared with naïve KMeans method How does it compare with other semi-supervised clustering methods?

Reference S. Basu, M. Bilenko, and R.J. Mooney, “A Probabilistic Framework for Semi-Supervised Clustering,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), Aug. 2004.

Thank you!