Consensus Partition Liang Zheng 5.21.

Slides:



Advertisements
Similar presentations
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Longest Common Subsequence
Clustering Categorical Data The Case of Quran Verses
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Introduction to Bioinformatics
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
4. Ad-hoc I: Hierarchical clustering
Segmentation Graph-Theoretic Clustering.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Radial Basis Function Networks
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Efficient Gathering of Correlated Data in Sensor Networks
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Correlation Clustering
Learning to Align: a Statistical Approach
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Mathematical Foundations of AI
Chapter 15 – Cluster Analysis
Constrained Clustering -Semi Supervised Clustering-
Data Mining K-means Algorithm
Parallel Density-based Hybrid Clustering
Haim Kaplan and Uri Zwick
Data Clustering Michael J. Watts
A Consensus-Based Clustering Method
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
Clustering.
Hierarchical clustering approaches for high-throughput data
Segmentation Graph-Theoretic Clustering.
Information Organization: Clustering
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
On the k-Closest Substring and k-Consensus Pattern Problems
Yuzhong Qu Nanjing University
Data Mining – Chapter 4 Cluster Analysis Part 2
3. Brute Force Selection sort Brute-Force string matching
A Fundamental Bi-partition Algorithm of Kernighan-Lin
3. Brute Force Selection sort Brute-Force string matching
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Approximation Algorithms for the Selection of Robust Tag SNPs
Hierarchical Clustering
Clustering.
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Consensus Partition Liang Zheng 5.21

Outline Introduction Problem formulation Optimization Method Experiment study Conclusion 2

Introduction W is a set of properties At some moments, a user might It has a public alignment, e.g. an equivalence relation R on W. An equivalence relation can also be represented by a partition At some moments, a user might See a list of a subset of W Align elements of W (move/remove an item to/from a partition) The problem How to preserve personal alignment for each user? How to improve (optimize) the public alignment according to users’ alignments?

Introduction Notations and Definitions V ={v1, v2, ..., vn} the set of objects. (1 ≤ i ≤ n). P = {P1, P2, ..., Pm} is a set of partitions, where each Pi ={Ci,1, Ci,2, ..., Ci,d} is a partition of the set of objects V with d clusters. Ci,j is the jth cluster of the ith partition. (1 ≤ i ≤ m). Pi(v) denote the label of the cluster to which the object v belongs i.e. Pi(v) =j iff v Ci,j PV the set of all possible partitions with the set of V (P PV ) P* consensus partition P*  PV, which better represents the properties of each partition in P

Problem statement: Clustering Aggregation(Cluster Ensemble) Input: Given m partitions P={P1, P2, ..., Pm} over n items, Output: find a consensus partition P* such that minimizes the total number of disagreements with the m partitions. maximizes the similarity with the m partitions.

Problem formulation What Means a Good Consensus Partition P* ? The distance between P* and the input partitions. partition-distance(1) (Gusfield ,2002) Two partitions P1 and P2 of V are identical if and only if every cluster in P1 is a cluster in P2 (the converse is then forced). partition-distance d(P1, P2), is the minimum number of elements that must be deleted from V , so that the two induced partitions (P and P restricted to the remaining elements) are identical. The problem of computing the partition-distance can be cast naturally as a node-cover problem on a graph derived from the partitions.

Example: V={1,2,3,4,5,6,7,8,9}; P={P1, P2} ; P1={{1,2}, {4,5,6,7}, {3,8,9}}; P2={{1,2,4,5}, {8,9}, {3,6,7}} d(P1, P2)=N(G(P1, P2))

Problem formulation partition-distance(2) (Gionis ,2005) symmetric difference distance or Mirkin distance d(P1, P2). Consider two objects u and v in V . The following simple 0/1 distance function checks if two partitions P1 and P2 agree on the clustering of u and v. d(P1, P2)= du,v(P1, P2)

Problem formulation Formal Definition of Clustering Aggregation(CA) Given a set of objects V and m partitions {P1 , P2 , ..., Pm} on V. find a consensus partition P* that minimizes This optimization problem is NP- complete. (Barthelemy et al, 1995.) We have to use heuristic algorithms. d(P* , Pi )

Problem formulation Generalization of CA-- Correlation clustering(CC) Given a set of objects V and distances Xuv [0,1] for all pairs u, v  V, find a partition P* for the objects in V that minimizes the score function This optimization problem is NP- complete.

P3={{1,2,6}, {3,4}, {5}} ; P4={{1,2,5}, {4,6}, {3}} Example: V={1,2,3,4,5,6}; P={P1, P2 , P3 , P4 } ; (n=6;m=4) P1={{1,2}, {3,4},{5}, {6}}; P2={{1,2,4}, {3,5}, {6}} ; P3={{1,2,6}, {3,4}, {5}} ; P4={{1,2,5}, {4,6}, {3}} objects co-occurrence matrix M The input distance matrix [Xuv ] 1 2 3 4 5 6 1 2 3 4 5 6 3/4 1/2 M=MP1+ MP2+ MP3+ MP4 xuv=(m-muv)/m

Clique partitioning problem Optimization Method Clique partitioning problem  having a maximum total weight

Optimization Method Factor-2 Approximation Algorithm [Filkov04] Given an instance of CA, select a partition p  {P1, P2, ..., Pm} that minimizes S=  d( p , Pi ) Algorithm is factor-2 approximation to problem CA; time complexity of this algorithm is O(m2n) ; O(n) compute the distance between any two partitions, there are O(m2) pairs.

Optimization Method Agglomerative clustering (bottom-up) [FrJa02,GMT07] It starts by placing every node into a singleton cluster. If the average distance of the closest pair of clusters is less than ½, then the two clusters are merged into a single cluster; If not have two clusters with average distance smaller than 1/2, stops. i.e. {{1},{2},{3},{4},{5},{6}}{{1,2},{3,4},{5},{6}} Divisive clustering (top-down) [GMT07] starts by placing all nodes into a single cluster. Then it finds the pair of nodes that are furthest apart, and places them into different clusters. These two nodes become the centers of the clusters. The remaining nodes are assigned to the center that incurs the least cost. This procedure is repeated iteratively, at the end of each step, the cost of the new solution is computed. If it is lower than that of the previous step then the algorithm continues. i.e. {1,2,3,4,5,6}{ {1,2,6},{3,4,5}}{{1,2},{6},{3,4},{5}}

Optimization Method LocalSearch [GMT07] //One-element Move[Filkov04] The algorithm then goes through the nodes and it considers placing them into a different cluster, or creating a new singleton cluster with this node. The node is placed in the cluster that yields the minimum cost. The process is iterated until there is no move that can improve the cost. The LocalSearch can be used as a clustering algorithm, but also as a post-processing step, to improve upon an existing solution. i.e. {{1,2},{3},{4},{5},{6}} {{1,2},{3,4},{5},{6}}

Optimization Method The Fusion-Transfer (FT) method [Guénoche2011] Fusion, a hierarchical ascending method Starting from the atomic partition P0, at each step the two classes maximizing the score value of the resulting partition are joined. Transfer, best-one-element-move method The weight of the assignment of any element to any class is computed. bottom-up + LocalSearch

Optimization Method Relabeling+voting Re-labeling Voting C1 C2 C3 v1 1 Find the correspondence between the labels in the partitions and fuse the clusters with the same labels by voting [DuFr03,DWH01] Re-labeling Voting C1 C2 C3 v1 1 3 2 v2 v3 v4 v5 v6 C1 C2 C3 v1 1 v2 v3 2 v4 v5 3 v6 C* 1 2 3

Experiment study Dataset: Synthetic + Real Comparison of average sum of distances (n; k) + noise data Comparison of Run time

Conclusion Two main approaches: objects co-occurrence and median partition Preserving personal partition in Sview Future work Compute a consensus partition based on various methods Personal partition fused consensus partition

Thanks!