A Cumulative Voting Consensus Method for Partitions with a Variable Number of Clusters Hanan G. Ayad, Mohamed S. Kamel, ECE Department University of Waterloo,

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
ECG Signal processing (2)
Clustering Categorical Data The Case of Quran Verses
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Fast Algorithms For Hierarchical Range Histogram Constructions
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Lecture 21: Spectral Clustering
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Ensemble Learning: An Introduction
1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Radial Basis Function Networks
Clustering Unsupervised learning Generating “classes”
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Summarized by Soo-Jin Kim
On comparison of different approaches to the stability radius calculation Olga Karelkina Department of Mathematics University of Turku MCDM 2011.
Presented by Tienwei Tsai July, 2005
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Cs: compressed sensing
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
A Comparison of Resampling Methods for Clustering Ensembles
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
1 EE571 PART 3 Random Processes Huseyin Bilgekul Eeng571 Probability and astochastic Processes Department of Electrical and Electronic Engineering Eastern.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
STATIC ANALYSIS OF UNCERTAIN STRUCTURES USING INTERVAL EIGENVALUE DECOMPOSITION Mehdi Modares Tufts University Robert L. Mullen Case Western Reserve University.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Classification and Regression Trees
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Semi-Supervised Clustering
Data Mining: Concepts and Techniques
Data Mining K-means Algorithm
A Consensus-Based Clustering Method
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Revision (Part II) Ke Chen
Information Organization: Clustering
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Generally Discriminant Analysis
Presentation transcript:

A Cumulative Voting Consensus Method for Partitions with a Variable Number of Clusters Hanan G. Ayad, Mohamed S. Kamel, ECE Department University of Waterloo, Ontario, Canada Hamdi JENZRI MRL Seminar

Outline  Introduction  Consensus Clustering  Review of consensus methods  Contribution of the authors  Theoretical formulation  Algorithms  Experimental results  Conclusion 2

Introduction  Cluster Analysis: discovery of a meaningful grouping for a set of data objects by finding a partition that optimizes an objective function  The number of ways of partitioning a set of n objects into k non empty clusters is a Stirling set number of the second kind, which is of the order of k n /k!  A well-known dilemma in data clustering is the multitude of models for the same set of data objects,  Different algorithms,  Different distance measures,  Different features for characterizing the objects,  Different scales (the number of clusters)...  This issue led to a lot of research work that addressed the problem of comparison and consensus of data clustering 3

Consensus Clustering  Consensus Clustering (Cluster ensembles): Finding a consensus partition that summarizes an ensemble of b partitions in some meaningful sense … Data Method 1 Method 2 Method b … 4

Objectives of Consensus Clustering  Improving clustering accuracy over a single data clustering,  Allowing the discovery of arbitrarily shaped cluster structures,  Reducing the instability of a clustering algorithm due to noise, outliers, or randomized algorithms,  Reusing preexisting clusterings (knowledge reuse),  Exploring random feature subspaces or random projections for high-dimensional data,  Exploiting weak clusterings such as splitting the data with random hyperplanes,  Estimating confidence in cluster assignments for individual observations,  Clustering in distributed environments including feature or object distributed clustering. 5

Consensus Approaches  Axiomatic: is concerned with deriving possibility / impossibility theorems on the existence and uniqueness of consensus partitions satisfying certain conditions.  Constructive: specifies rules for constructing a consensus, such as the Pareto rule, also known as the strict consensus rule, whereby two objects occur together in a consensus if and only if they occur together in all the individual partitions.  Combinatorial optimization: considers an objective function J, measuring the remoteness of a partition to the ensemble of partitions, and searches for a partition in the set of all possible partitions of the data objects that minimizes J. The approach is related to the notion of central value in statistics. 6

Challenge in Consensus Clustering  The purely symbolic nature of the labels returned by the clustering algorithms  Let’s take the example of partitioning a 7-objects data set into k= 3 clusters, where the first 3 objects belong to a first cluster, the following 2 objects to a second cluster and the remaining 2 objects to a third cluster:  The vector representation of the clustering result can be:  [ ], [ ], [ ] or any of the k! = 3! = 6 possible label permutations  The matrix representation of the same partition can be: ,, or any of the k! = 3! = 6 possible permutations of the rows

Review of Consensus Methods AuthorsAlgorithmDescription Computational Complexity Disadvantages Strehl and Ghosh Cluster-based Similarity Partitioning Algorithm (CSPA) The optimal consensus is the partition that shares the most information with the ensemble of partitions, as measured by the Average Normalized Mutual Information (ANMI) Quadratic in n They seek balanced size clusters, making them unsuitable for data with highly unbalanced clusters Hyper Graph Partitioning Algorithm (HGPA) Linear in n Meta CLustering Algorithm (MCLA) 8

Review of Consensus Methods AuthorsAlgorithmDescriptionComputational Complexity Fred and Jain Evidence Accumulation Clustering (EAC) The cluster ensemble is mapped to a co-association matrix, where entries can be interpreted as votes (or vote ratios) on the pairwise co- occurrences of objects and are computed as the number of times each pair of objects co- occurs in the same cluster of a base clustering (relative to the total number of base clusterings). The final consensus clustering is extracted by applying linkage- based clustering algorithms on the co-association matrix. Quadratic in n 9

Contributions of the Authors  They introduce a new solution for the problem of aligning the cluster labels of a given clustering with k i clusters with respect to a reference clustering with k o clusters.  This is done through what they call “cumulative voting”  Plurality voting scheme (winner takes all) allows each voter to vote for one option, and the option that receives the most votes is the winner.  Cumulative voting is a rated voting scheme, where each voter gives a numeric value (called rating) to each option such that the voter’s ratings add up to a certain total (for example, a number of points).  Cumulative voting is sometimes referred to as weighted voting.  As proposed in this paper, cumulative voting maps an input k i -partition into a probabilistic representation as a k o -partition with cluster labels corresponding to the labels of the reference k o clusters.  They formulate the selection criterion for the reference clustering based on the maximum information content, as measured by the entropy. 10

Contributions of the Authors  They explore different cumulative voting models  with a fixed reference partition  Un-normalized weighting  Normalized weighting  with an adaptive reference partition, the reference partition is incrementally updated so as to relax the dependence on the selected reference. Furthermore, these updates are performed in a decreasing order of entropies so as to smooth the updates of the reference partitions.  Based on the proposed cumulative vote mapping, they define the criterion for obtaining a first summary of the ensemble as the minimum average squared distance between the mapped partitions and the optimal representation of the ensemble, 11

Contributions of the Authors  Finally, they formulate the problem of extracting the optimal consensus partition as that of finding a compressed summary of the estimated distribution that preserves the maximum relevant information.  They relate the problem to the Information Bottleneck (IB) method of Tishby et al. and propose an efficient solution using an agglomerative algorithm similar to the agglomerative IB algorithm of Slonim and Tishby, which minimizes the Jensen- Shannon (JS) divergence within the cluster.  They demonstrate the effectiveness of the proposed consensus algorithms using ensembles of K-Means clusterings with a randomly selected number of clusters, where the goal is to enable the discovery of arbitrary cluster structures. 12

Architecture of the proposed Method Minimizing the Jensen- Shannon (JS) divergence Minimum average squared distance Cumulative Vote Mapping Maximum entropy Selection of the Reference Partition Un-normalized Vote Weighting Scheme Normalized Vote Weighting Scheme First Summary of the partitions Finding the optimal consensus Adaptive Vote Weighting Scheme 13

Theoretical Formulation  Let X denote a set of n data objects x j, where x j R d  Partition of X into k clusters is represented as an n-dimentional cluster labeling vector y C n, where C = {c 1,..., c k }  Alternatively, the partition can be represented as a k x n matrix denoted as U with a row for each cluster and a column for each object x j  Hard partition, U is referred to as a binary stochastic matrix, where each entry u lj {0, 1}, and  Soft partition, let C denote a categorical random variable defined over the set of cluster labels C, a stochastic partition corresponds to a probabilistic clustering and is defined as a partition where each observation is assigned to a cluster by an estimated posterior probability 14

Theoretical Formulation  Consider as input an ensemble of b partitions with a variable number of clusters  U = {U 1, …, U b }, such that each partition U i is a k i x n binary stochastic matrix (hard partitions)  They use a center-based clustering algorithm, namely the K-Means for text data to generate the cluster ensembles.  The number of clusters for individual partitions is randomly selected within some range k i [k min, k max ]  They address the problem of estimating a consensus partition for the set of data objects X that optimally represents the ensemble U 15

Selection of the reference partition  Consider the random variable C i defined over the cluster labels of the ith clustering, with probability distribution where is the number of objects assigned to cluster  The Shannon entropy, defined as a measure of the average information content (uncertainty) associated with a random outcome is a function of its distribution  The higher the entropy, the more informative is the distribution  Thus, the partition U i U with the highest entropy represents the cluster label distribution with the maximum information content: 16

Cumulative Vote Mapping  Consider some reference partition U o and a partition U i U with k o and k i clusters and associated random variables C o and C i with estimated probability distributions denoted as  U i is designated as the voting partition with respect to U o  Each cluster is viewed as a voter that votes for each of the clusters, with a weight denoted as  The weights are represented in a k o x k i cumulative vote weight matrix, denoted as  Un-normalized weighting scheme  Normalized weighting scheme 17

Un-normalized Weighting Scheme  It is connected to the co-association matrix commomly used for summarizing a set of partitions  Let and be the qth and lth row vectors of U o and U i respectively  Represents the number of objects belonging to both clusters dv and  The binary k i vectors of U i are transformed into k o frequency vectors represented in the mapped matrix U o,i  Members of cluster are scaled by when mapped as members of clusters 18

Un-normalized Weighting Scheme  Each entry of the k o x n matrix U o,i is taken as an assignment frequency of object x j to cluster where  Example: 19

Normalized Weighting Scheme  Normalizing the weights to sum to 1  The weight is computed as the average of the conditional probabilities of cluster, given each of the data objects assigned to cluster, which is taken as an estimate of the conditional probability. When the reference partition is represented as a binary stochastic matrix..  Each of the k i columns of is a probability vector 20

Normalized Weighting Scheme  Each partition U i is mapped using:  Each entry of U o,i is considered as an estimate of  U o,i is a stochastic partition representing  Consider  The estimated priors 21

Normalized Weighting Scheme  Which gives:  Which ensures the entropy preserving property  The normalization scheme reflects the intuition that objects that are members of a large cluster are considered less strongly associated to each other than objects belonging to a small cluster 22

Normalized Weighting Scheme  Example 23

Average-Squared-Distance Criterion for Mapped Partitions  The chosen criterion for finding a stochastic partition Û of X summarizing a set of b partitions as the minimum average distance between the mapped ensemble partitions and the optimal consensus is defined as follows:  Where represents the mapping of partition U i into the stochastic partition U o,i, defined with respect to the reference partition U o  The dissimilarity function h() is defined as the average squared distance between the probability (column) vectors fd and and given as 24

Average-Squared-Distance Criterion for Mapped Partitions  This minimization problem can be solved directly by calculating as the average of the probability vectors  It’s to note that using the cumulative vote mapping of the partitions, the number of clusters of Û is preset through the selected reference partition, regardless of the number of clusters k i of each partition. 25

Algorithms  Un-normalized Reference-based cumulative Voting (URCV) 26

Algorithms  Reference-based cumulative voting (RCV) 27

Algorithms  Adaptive cumulative voting (ACV) 28

Finding the optimal consensus  The above summarization of the ensemble do not always lead to capturing the most “meaningful” or “relevant” information, given the arbitrary number of clusters k i  The problem of extracting a compressed representation of stochastic data that captures only the relevant or meaningful information was addressed by the Informarion Bottleneck (IB) method of Tishby et al.  It addresses a trade-off between compressing the representation and preserving meaningful information.  In this paper, they formulate a subsequent problem as that of finding an efficient representation of random variable C, described by random variable Z, that preserves the maximum amount of relevant information about X, based on the estimated distribution 29

Finding the optimal consensus  Solution that is approximately equivalent to the AIB algorithm but requires less computational time  They map the k o clusters to a (k o ) 2 JS divergence matrix  They apply a distance-based clustering algorithm  Agglomerative group average algorithm because it minimizes the average pairwise distances between members of the merged clusters, as given by its objective function  Where S1 and S2 denote a pair of distributions, whose cardinalities are |S 1 | and |S 2 |, respectively 30

Finding the optimal consensus  The JS divergence is the entropy of the weighted average distribution minus the weighted average of the entropies of the individual distributions and. It is symmetric, bounded, nonnegative, and equal to zero when. 31

Finding the optimal consensus  When a k-partition {S 1, …, S k } is obtained, the consensus clustering Z k described by estimates of the prior probabilities of the consensus clusters and the posterior probabilities are computed using 32

Finding the optimal consensus 33

Case of identical partitions  An essential property for consensus clustering algorithms is that when all individual partitions represent a perfect consensus, that is, they are identical with respect to cluster label permutations, the consensus solution should be the same partition.  In the algorithms presented in this paper, this property is satisfied.  In the case of the normalized weighting scheme becomes the identity matrix I  In the case of the un-normalized weighting scheme, is a diagonal matrix whose qth diagonal element equals to. Each entry of U o,i is equal to the size of the corresponding cluster. After averaging we get the exact same partition started with. 34

Experimental Setup  They compare the performances of the URCV, RCV, and ACV algorithms with several recent consensus algorithms and with the average quality of the ensemble partitions  External Validation: Adjusted Rand Index with respect to the true clustering  They report the distribution of the Adjusted Rand Index using boxplots for r =20 runs for small data sets and for r=5 runs for large data sets (n ≥ 1000)  Internal Validation: they measure the Normalized Mutual Information (NMI) between every pair of consensus clusterings over multiple runs of the proposed consensus algorithms (interconsensus NMI)  To assess the stability of the consensus clustering over multiple runs  Variations in the consensus clustering across multiple runs are due to the ensemble partitions being generated with a random number of clusters and with different random seeds for the K-Means algorithm.  Default value for b = 25 35

Experimental Setup 36

Algorithms compared to  The binary one-to-one voting algorithm of Dimitriadou et al. (BVA) 37

Data sets 38

39

40

41

42

43

44

45

Computational Complexity AlgorithmComplexity Co-association based algorithmsO(n 2 b) CSPAO(n 2 kb) MCLAO(nk 2 b 2 ) HGPAO(nkb) QMIO(nkb) URCV, RCV, ACVO(nk o 2 b) 46

Conclusion  Cumulative voting to map an input k i -partition to a reference k o -partition with probabilistic assignments  Un-normalized  Normalized  The reference partition was chosen as the one having the maximum entropy  Minimum average distance criterion between the mapped ensemble partitions and the summarizing stochastic partition Û  Extracting the optimal consensus partition from Û by minimizing the JS divergence within the cluster  Over all, the proposed methods performed better than the existing consensus methods, with less complexity 47