A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

Slides:



Advertisements
Similar presentations
CSE 211 Discrete Mathematics
Advertisements

CSE 211 Discrete Mathematics
Lecture 15. Graph Algorithms
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Lecture 5 Graph Theory. Graphs Graphs are the most useful model with computer science such as logical design, formal languages, communication network,
Edge-connectivity and super edge-connectivity of P 2 -path graphs Camino Balbuena, Daniela Ferrero Discrete Mathematics 269 (2003) 13 – 20.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
Greedy Algorithms Greed is good. (Some of the time)
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
FUNDAMENTAL PROBLEMS AND ALGORITHMS Graph Theory and Combinational © Giovanni De Micheli Stanford University.
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
HCS Clustering Algorithm
Introduction to Graphs
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
KNURE, Software department, Ph , N.V. Bilous Faculty of computer sciences Software department, KNURE Discrete.
Minimum Spanning Trees. Subgraph A graph G is a subgraph of graph H if –The vertices of G are a subset of the vertices of H, and –The edges of G are a.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Segmentation via Graph Cuts
Gene expression & Clustering (Chapter 10)
GRAPH Learning Outcomes Students should be able to:
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Design Techniques for Approximation Algorithms and Approximation Classes.
GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,
1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20.
1 ELEC692 Fall 2004 Lecture 1b ELEC692 Lecture 1a Introduction to graph theory and algorithm.
7.1 and 7.2: Spanning Trees. A network is a graph that is connected –The network must be a sub-graph of the original graph (its edges must come from the.
CS774. Markov Random Field : Theory and Application Lecture 02
Basic Notions on Graphs. The House-and-Utilities Problem.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Introduction to Graph Theory
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
A Framework for Reliable Routing in Mobile Ad Hoc Networks Zhenqiang Ye Srikanth V. Krishnamurthy Satish K. Tripathi.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Chapter 9: Graphs.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Lecture 19 Minimal Spanning Trees CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Indian Institute of Technology Kharagpur PALLAB DASGUPTA Graph Theory: Trees Pallab Dasgupta, Professor, Dept. of Computer Sc. and Engineering, IIT
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Iterative Improvement for Domain-Specific Problems Lecturer: Jing Liu Homepage:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Prof. Yu-Chee Tseng Department of Computer Science
The NP class. NP-completeness
Finding Dense and Connected Subgraphs in Dual Networks
Groups of vertices and Core-periphery structure
Computing Connected Components on Parallel Computers
Graph theory Definitions Trees, cycles, directed graphs.
Network analysis.
Hierarchical clustering approaches for high-throughput data
Enumerating Distances Using Spanners of Bounded Degree
Randomized Algorithms CS648
V17 Metabolic networks - Graph connectivity
Clustering.
Trees.
Consensus Partition Liang Zheng 5.21.
SEG5010 Presentation Zhou Lanjun.
Graphs and Algorithms (2MMD30)
V12 Menger’s theorem Borrowing terminology from operations research
Clustering.
Presentation transcript:

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo

Topics to be Covered Introduction Important Definitions in Graphs HCS Algorithm Properties of HCS Clustering Modified HCS Algorithm Key features of HCS Algorithm Summary

Introduction Cluster analysis seeks grouping of elements into subsets based on similarity between pairs of elements. The goal is to find disjoint subsets, called clusters. Clusters should satisfy two criteria: Homogeneity Separation

Introduction The process of generating the subsets is called clustering. Cluster analysis is a fundamental problem in experimental science where observations have to be classified into groups. Cluster analysis has applications in biology, medicine, economics, psychology, astro- physics and numerous other fields.

Introduction Cluster analysis is most widely used in the study of gene expression in micro biology. The approach presented here is graph theoretic. Similarity data is used to form a similarity graph. gene 1 gene 2 gene 3 gene 1 similar to gene 2 gene 1 similar to gene 3 gene 2 similar to gene 3

Introduction In similarity graph data vertices correspond to elements and edges connect elements with similarity values above some threshold. Clusters in a graph are highly connected subgraphs. Main challenges in finding the clusters are: Large sets of data Inaccurate and noisy measurements

Important Definitions in Graphs Edge Connectivity: It is the minimum number of edges whose removal results in a disconnected graph. It is denoted by k(G). For a graph G, if k(G) = l then G is called an l- connected graph.

Important Definitions in Graphs Example: GRAPH 1 GRAPH 2 The edge connectivity for the GRAPH 1 is 2. The edge connectivity for the GRAPH 2 is 3. A B D C A B C D

Important Definitions in Graphs Cut: A cut in a graph is a set of edges whose removal disconnects the graph. A minimum cut is a cut with a minimum number of edges. It is denoted by S. For a non-trivial graph G iff |S| = k(G).

Important Definitions in Graphs Example: GRAPH 1 GRAPH 2 The min-cut for GRAPH 1 is across the vertex B or D. The min-cut for GRAPH 2 is across the vertex A,B,C or D. A B D C A B C D

Important Definitions in Graphs Distance d(u,v): The distance d(u,v) between vertices u and v in G is the minimum length of a path joining u and v. The length of a path is the number of edges in it.

Important Definitions in Graphs Diameter of a connected graph: It is the longest distance between any two vertices in G. It is denoted by diam(G). Degree of vertex: Its is the number of edges incident with the vertex v. It is denoted by deg(v). The minimum degree of a vertex in G is denoted by delta(G).

Important Definitions in Graphs Example: d(A,D) = 1 d(B,D) = 2 d(A,E) = 2 Diameter of the above graph = 2 deg(A) = 3 deg(B) = 2 deg(E) = 1 Minimum degree of a vertex in G = 1 A B D C E

Important Definitions in Graphs Highly connected graph: For a graph with vertices n > 1 to be highly connected if its edge-connectivity k(G) > n/2. A highly connected subgraph (HCS) is an induced subgraph H in G such that H is highly connected. HCS algorithm identifies highly connected subgraphs as clusters.

Important Definitions in Graphs Example: No. of nodes = 5 Edge Connectivity = 1 A B D C E Not HCS!

Important Definitions in Graphs Example continued: No. of nodes = 4 Edge Connectivity = 3 A B D C HCS!

HCS Algorithm HCS(G(V,E)) begin (H, H’,C)  MINCUT(G) if G is highly connected then return (G) else HCS(H) HCS(H’) end if end

HCS Algorithm The procedure MINCUT(G) returns H, H’ and C where C is the minimum cut which separates G into the subgraphs H and H’. Procedure HCS returns a graph in case it identifies it as a cluster. Single vertices are not considered clusters and are grouped into singletons set S.

HCS Algorithm Example

HCS Algorithm Example Continued

HCS Algorithm Example Continued Cluster 2 Cluster 1 Cluster 3

HCS Algorithm The running time of the algorithm is bounded by 2N*f(n,m). N - number of clusters found f(n,m) – time complexity of computing a minimum cut in a graph with n vertices and m edges Current fastest deterministic algorithms for finding a minimum cut in an unweighted graph require O(nm) steps.

Properties of HCS Clustering Diameter of every highly connected graph is at most two. That is any two vertices are either adjacent or share one or more common neighbors. This is a strong indication of homogeneity.

Properties of HCS Clustering Each cluster is at least half as dense as a clique which is another strong indication of homogeneity. Any non-trivial set split by the algorithm has diameter at least three. This is a strong indication of the separation property of the solution provided by the HCS algorithm.

Modified HCS Algorithm Example

Modified HCS Algorithm Example – Another possible cut

Modified HCS Algorithm Example – Another possible cut

Modified HCS Algorithm Example – Another possible cut

Modified HCS Algorithm Example – Another possible cut Cluster 1 Cluster 2

Modified HCS Algorithm Iterated HCS: Choosing different minimum cuts in a graph may result in different number of clusters. A possible solution is to perform several iterations of the HCS algorithm until no new cluster is found. The iterated HCS adds another O(n) factor to running time.

Modified HCS Algorithm Singletons adoption: Elements left as singletons can be adopted by clusters based on similarity to the cluster. For each singleton element, we compute the number of neighbors it has in each cluster and in the singletons set S. If the maximum number of neighbors is sufficiently large than by the singletons set S, then the element is adopted by one of the clusters.

Modified HCS Algorithm Removing Low Degree Vertices: Some iterations of the min-cut algorithm may simply separate a low degree vertex from the rest of the graph. This is computationally very expensive. Removing low degree vertices from graph G eliminates such iteration and significantly reduces the running time.

Modified HCS Algorithm HCS_LOOP(G(V,E)) begin for (i = 1 to p) do remove clustered vertices from G H  G repeatedly remove all vertices of degree < d(i) from H

Modified HCS Algorithm until(no new cluster is found by the HCS call) do HCS(H) perform singletons adoption remove clustered vertices from H end until end for end

Key features of HCS Algorithm HCS algorithm was implemented and tested on both simulated and real data and it has given good results. The algorithm was applied to gene expression data. On ten different datasets, varying in sizes from 60 to 980 elements with 3-13 clusters and high noise rate, HCS achieved average Minkowski score below 0.2.

Key features of HCS Algorithm In comparison greedy algorithm had an average Minkowski score of 0.4. Minkowski score: A clustering solution for a set of n elements can be represented by n x n matrix M. M(i,j) = 1 if i and j are in the same cluster according to the solution and M(i,j) = 0 otherwise. If T denotes the matrix of true solution, then Minkowski score of M = ||T-M|| / ||T||

Key features of HCS Algorithm HCS manifested robustness with respect to higher noise levels. Next, the algorithm were applied in a blind test to real gene expression data. It consisted of 2329 elements partitioned into 18 clusters. HCS identified 16 clusters with a score of 0.71 whereas Greedy got a score of 0.77.

Key features of HCS Algorithm Comparison of HCS algorithm with Optimal Graph theoretic approach to data clustering

Key features of HCS Algorithm For the graph seen previously, with number of clusters 3 as input, HCS algorithm and Optimal graph theoretic approach to data clustering are compared. HCS algorithm finds all the three clusters G1, G2 and G3. Optimal graph theoretic approach to data clustering finds isolated vertex v in {a,b,c,d}. The clusters found by optional approach are two. One is G1\{v} and (G2UG3)\{v}.

Summary Clusters are defined as subgraphs with connectivity above half the number of vertices Elements in the clusters generated by HCS algorithm are homogeneous and elements in different clusters have low similarity values Possible future improvement includes finding maximal highly connected subgraphs and finding a weighted minimum cut in an edge- weighted graph.

Thank You!!