A Comparison of Resampling Methods for Clustering Ensembles

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Hierarchical Clustering, DBSCAN The EM Algorithm
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Hierarchical Dirichlet Processes
CS479/679 Pattern Recognition Dr. George Bebis
K Means Clustering , Nearest Cluster and Gaussian Mixture
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Clustering II.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Cluster Analysis.
Clustering.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Visual Recognition Tutorial
Amos Storkey, School of Informatics. Density Traversal Clustering and Generative Kernels a generative framework for spectral clustering Amos Storkey, Tom.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Radial Basis Function Networks
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Clustering Unsupervised learning Generating “classes”
Ensemble Clustering.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Ensembles of Partitions via Data Resampling
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Flat clustering approaches
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Analysis of Social Media MLD , LTI William Cohen
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Machine Learning: Ensemble Methods
Semi-Supervised Clustering
Classification of unlabeled data:
Clustering (3) Center-based algorithms Fuzzy k-means
A Consensus-Based Clustering Method
Clustering Evaluation The EM Algorithm
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Generally Discriminant Analysis
EM Algorithm and its Applications
Presentation transcript:

A Comparison of Resampling Methods for Clustering Ensembles Behrouz Minaei, Alexander Topchy and William Punch Department of Computer Science and Engineering MLMTA 2004, Las Vegas, June 22th 2004

Outline Clustering Ensemble Resampling Methods Experimental study How to generate different partitions? How to combine multiple partitions? Resampling Methods Bootstrap vs. Subsampling Experimental study Methods Results Conclusion

Ensemble Benefits Combinations of classifiers proved to be very effective in supervised learning framework, e.g. bagging and boosting algorithms Distributed data mining requires efficient algorithms capable to integrate the solutions obtained from multiple sources of data and features Ensembles of clusterings can provide novel, robust, and stable solutions

Is Meaningful Clustering Combination Possible? 2 clusters 2 clusters -5 5 -6 -4 -2 2 4 6 3 clusters 3 clusters “Combination” of 4 different partitions can lead to true clusters!

Pattern Matrix, Distance matrix Features X1 x11 x12 … x1j x1d X2 x21 x22 x2j x2d Xi xi1 xi2 xij xid XN xN1 xN2 xNj xNd    X1 X2 … Xj XN X1 d11 d12 d1j d1N d21 d22 d2j d2N Xi di1 di2 dij diN dN1 dN2 dNj dNN

Representation of Multiple Partitions Combination of partitions can be viewed as another clustering problem, where each Pi represents a new feature with categorical values Cluster membership of a pattern in different partitions is regarded as a new feature vector Combining the partitions is equivalent to clustering these tuples objects P1 P2 P3 P4 x1 1 A  Z X2  Y X3 3 D ? X4 2 X5 B  X6 C X7 7 objects clustered by 4 algorithms

Re-labeling and Voting   C-1 C-2 C-3 C-4 X1 1 A  Z X2  Y X3 3 B ? X4 2 C X5  X6 X7   C-1 C-2 C-3 C-4 X1 1 2 X2 X3 3 ? X4 X5 X6 X7 FC 1 3 ? 2

Co-association As Consensus Function Similarity between objects can be estimated by the number of clusters shared by two objects in all the partitions of an ensemble This similarity definition expresses the strength of co-association of n objects by an n x n matrix xi: the i-th pattern; pk(xi): cluster label of xi in the k-th partition; I(): Indicator function; N = no. of different partitions This consensus function eliminates the need for solving the label correspondence problem

Taxonomy of Clustering Combination Approaches Generative mechanism Consensus function Different initialization for one algorithm Different subsets of objects Co-association-based Different subsets of features Projection to subspaces Voting approach Hypergraph methods Different algorithms Mixture Model (EM) CSPA HGPA MCLA Single link Comp. link Avg. link Information Theory approach Others … Project to 1D Rand. cuts/plane Deterministic Resampling Diversity of clustering: How to generate different partitions? What is the source of diversity in the components? Consensus function: How to combine different clusterings? How to resolve the label correspondence problem? How to ensure symmetrical and unbiased consensus with respect to all the component partitions?

Resampling Methods Bootstrapping (Sampling with replacement) Create an artificial list by randomly drawing N elements from that list. Some elements will be picked more than once. Statistically on average 37% of elements are repeated Subsampling (Sampling without replacement) Control over the size of subsample

Experiment: Data sets Number of Classes Number of Features Total no of patterns Patterns per class Halfrings 2 400 100-300 2-spirals 200 100-100 Star/Galaxy 14 4192 2082-2110 Wine 3 13 178 59-71-48 LON 6 227 64-163 Iris 4 150 50-50-50

Subsampling results on Star/Galaxy

Subsampling results on Halfrings

Subsampling on Halfrings

Error Rate for Individual Clustering Data set k-means Single Link Complete Link Average Link Halfrings 25% 24.3% 14% 5.3% 2 Spiral 43.5% 0% 48% Iris 15.1% 32% 16% 9.3% Wine 30.2% 56.7% 32.6% 42% LON 27% 27.3% 25.6% Star/Galaxy 21% 49.7% 44.1%

# of clusters in partition, k Subsampling results # of clusters in partition, k # of Partitions B Sample Size S % of entire data Halfrings 15 50 200 50% 500 80 20% 20 100 25% 2 Spiral 150 75% 1000 Iris 3 33% 4 13% 5 Wine 56% 28% 10 LON 66% 44% Galaxy/ Star 1500 36% 24% 12% 5%

Taxonomy of Clustering Methods Heuristic Model-based Density-based Pattern matrix Proximity matrix Spatial Clustering Mixture model (Gaussian mixture, Latent class) Prototype-based (k-means, k-medoid) Linkage methods (single-link, complete-link, CHAMELEON) Graph-theoretic (MST, spectral clustering, Min-cut) Mode seeking (mean-shift) Kernel-based (DENCLUE) Grid-based (Wave-Cluster, STING) Clustering methods call for particular clustering algorithms Hundreds of clustering algorithms proposed

Clustering Combination Observations Each clustering algorithm addresses differently the issues of number of clusters, and the structure imposed on the data, and produces different data partitions A single clustering algorithm can produce distinct results on the same data set depending on parameter values Instead of a single clustering algorithm, use multiple algorithms Success of the ensemble of classifiers in supervised learning (bagging and boosting) is the motivation

Are There Any Clusters in Data? Most clustering algorithms find clusters, even if the data is uniform! 0.5 1 -0.2 0.2 0.4 0.6 0.8 1.2 0.5 1 -0.2 0.2 0.4 0.6 0.8 1.2 100 uniform data points k-means, k=3

Clustering in new space Essentially consensus must be found by clustering objects in new space – space of cluster labels In fact, features of new space are extracted from “old” features by different clustering algorithms New space can possibly be extended by adding original “old” features Problem becomes very rich with infinite ways to solve

How to Generate the Ensemble? Apply different clustering algorithms Single-link, EM, complete-link, spectral clustering, k-means Use random initializations of the same algorithm Output of k-means and EM depends on initial cluster centers Use different parameter settings of the same algorithm number of clusters, “width” parameter in spectral clustering, thresholds in linkage methods Use different feature subsets or data projections Re-sample data with or without replacement

Consensus Functions Co-association matrix (Fred & Jain 2002) Similarity between 2 patterns estimated by counting the number of shared clusters Single-link with max life-time for finding the consensus partition Hyper-graph methods (Strehl & Ghosh 2002) Clusters in different partitions represented by hyper-edges Consensus partition found by a k-way min-cut of the hyper-graph Re-labeling and voting (Fridlyand & Dudoit 2001) If the label correspondence problem is solved for the given partitions, a simple voting procedure can be used to assign objects in clusters Mutual Information (Topchy, Jain & Punch 2003) Maximize the mutual information between the individual partitions and the target consensus partition Finite mixture model (Topchy, Jain & Punch 2003) Maximum likelihood solution to latent class analysis problem in the space of cluster labels via EM

Consensus functions: Co-association approach (Fred 2001, Fred & Jain 2002) Similarity between objects can be estimated by the number of clusters shared by two objects in all the partitions of an ensemble. This similarity definition expresses the strength of co-association of objects by a matrix containing the values: Thus, one can use numerous similarity-based clustering algorithms by applying them to the matrix of co-association values.

Consensus functions: Hypergraph approach (Strehl & Ghosh 2002) All the clusters in the ensemble partitions can be represented as hyperedges on a graph with N vertices. Each hyperedge describes a set of objects belonging to the same clusters. A consensus function is formulated as a solution to k-way min-cut hypergraph partitioning problem. Each connected component after the cut corresponds to a cluster in the consensus partition. Hypergraph partitioning problem is NP-hard, but very efficient heuristics are developed for its solution with complexity proportional to the number of hyperedges O(|E|).

Consensus functions: Mutual information (Topchy et al. 2003a) Mutual information (MI) between the empirical probability distribution of labels in the consensus partition and the labels in the ensemble must be maximized. Under the assumption of independence of partitions, mutual information can be written as sum of pair-wise MIs between target and given partitions. An elegant solution can be obtained from a generalized definition of MI. Quadratic MI information can be effectively maximized by the k-means algorithm in the space of specially transformed cluster labels of given ensemble. Computational complexity of the algorithm is O(kNH)

Consensus functions: Finite Mixture model (Topchy et al. 2003b) The main assumption is that the labels yi are modeled as random variables drawn from a probability distribution described as a mixture of multivariate multinomial component densities:

Consensus functions: Finite Mixture model The objective of consensus clustering is formulated as a maximum likelihood estimation problem. To find the best fitting mixture density for a given data Y we must maximize the likelihood function with respect to the unknown parameters : EM algorithm is used to solve this maximum likelihood problem

Consensus functions: Re-labeling and Voting (Fridlyand & Dudoit 2001) If a label correspondence problem is solved for all given partitions, then a simple voting procedure can be used to assign objects in clusters. However, label correspondence is exactly what makes unsupervised combination difficult. A heuristic approximation to consistent labeling is possible. All the partitions in the ensemble can be re-labeled according to their best agreement with some chosen reference partition. The reference partition can be taken as one from the ensemble, or from a new clustering of the dataset.

Subsampling results on Galaxy/Star