Download presentation
Presentation is loading. Please wait.
1
1.On finding clusters in undirected simple graphs: application to protein complex detection 2.DPClus software tool 3.Concept of Line Graphs Today’s lecture will cover the following three topics Comparative Genomics (Network Biology)
2
Outline Introduction Some basic concepts The proposed algorithm The DPClus software Results & Discussion Conclusions On finding clusters in undirected simple graphs: application to protein complex detection
3
Introduction There is no universal definition of a cluster. But clustering is an important issue. Consequently there are diverse definitions and various methods. The major purpose of clustering is finding cohesive groups. Here, we are going to discuss a graph clustering algorithm.
4
Regarding a graph, a cluster is a subgraph whose nodes are densely connected with each other compared to their connections with other nodes in the graph. This is a flexible definition of a cluster. Intuitively, we can recognize two clusters in this arbitrary graph. Introduction But it is difficult to draw a big graph revealing its clusters.
5
An E. coli protein-protein interaction network---consisting of 3007 proteins and 11531 interactions (From Mori Lab NAIST, Japan) Some algorithm is needed to detect locally dense regions…… Introduction
6
Md. Altaf-Ul-Amin, Yoko Shinbo, Kenji Mihara, Ken Kurokawa and Shigehiko Kanaya, “Development and implementation of an algorithm for detection of protein complexes in large interaction networks”, BMC Bioinformatics 7:207, April 2006. Introduction
7
Some basic concepts It is likely that two nodes belong to the same cluster have more common neighbors than two nodes that are not
8
Some basic concepts
9
The density d of a cluster is the ratio of the number of edges present in it and the maximum possible number of edges in it. It is easy to realize that d = |E|/|E| max = 2*|E|/|N|*(|N|-1). d is a real number ranging from 0 to 1. Some basic concepts
10
Density of the total graph = 0.241 d=0.9 d=1.0 The density of the complexes are relatively higher Some basic concepts
11
Considering density alone is not enough Such situations can be tackled by keeping track of the periphery Some basic concepts Both the graphs consist of 8 nodes and both are of density 0.5 But one of them seems to be a single cluster while the other is divided into two clusters
12
Some basic concepts The cluster property of any node n with respect to any cluster k of density d k and size N k is defined as follows: cp nk =|E nk |/(d k * |N k |) Here, |E nk | is the total number of edges between the node n and each of the nodes of cluster k. Cluster property of node f 0.57 Cluster property of node f = 0.2
13
The proposed algorithm is a sequential constructive algorithm: It initializes the complex/cluster by choosing a seed node. It then repeatedly add other nodes on the basis of priority and some conditions. The major methods of the algorithm Choosing a seed node. Selecting a priority node. Checking necessary conditions before adding a node to a complex. The proposed Algorithm
14
Inputs to the algorithm are: The associated matrix of the network. A minimum threshold density for the generated clusters. A parameter to determine how we separate a complex from its periphery. Output of the algorithm are : Overlapping/non-overlapping complexes whose densities are more or equal to the given density. The proposed Algorithm
15
- Flowchart of the proposed Algorithm
16
0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 1 0 M = M uv = 1 if there is an edge between nodes u and v and 0 otherwise. The proposed Algorithm
17
1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 4 2 2 3 2 1 1 0 0 0 0 0 0 1 2 4 3 2 3 1 1 0 0 0 0 0 0 1 2 3 5 2 3 1 0 1 0 0 0 0 0 0 3 2 2 3 2 1 1 0 0 0 0 0 0 1 2 3 3 2 5 0 1 0 0 1 0 0 0 0 1 1 1 1 0 2 0 0 1 0 0 0 0 0 1 1 0 1 1 0 2 0 1 0 0 1 1 0 0 0 1 0 0 0 0 4 2 1 1 2 2 0 0 0 0 0 0 1 1 2 4 0 1 2 2 0 0 0 0 0 1 0 0 1 0 2 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 2 2 1 0 4 2 0 0 0 0 0 0 0 1 2 2 1 1 2 3 M 2 = (M 2 ) uv for u v represents the number of common neighbor of the nodes u and v. The proposed Algorithm
18
1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 4 2 2 3 2 1 1 0 0 0 0 0 0 1 2 4 3 2 3 1 1 0 0 0 0 0 0 1 2 3 5 2 3 1 0 1 0 0 0 0 0 0 3 2 2 3 2 1 1 0 0 0 0 0 0 1 2 3 3 2 5 0 1 0 0 1 0 0 0 0 1 1 1 1 0 2 0 0 1 0 0 0 0 0 1 1 0 1 1 0 2 0 1 0 0 1 1 0 0 0 1 0 0 0 0 4 2 1 1 2 2 0 0 0 0 0 0 1 1 2 4 0 1 2 2 0 0 0 0 0 1 0 0 1 0 2 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 2 2 1 0 4 2 0 0 0 0 0 0 0 1 2 2 1 1 2 3 M 2 = (M 2 ) uv for u v represents the number of common neighbor of the nodes u and v. The proposed Algorithm
19
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 The weights of edges are derived by squaring the associated matrix of the graph
20
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 The proposed Algorithm The weights of nodes (sum of the weights of the connecting edges)
21
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P121 P331 P421 P531 The proposed Algorithm Seed Neighbors
22
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P331 P531 P121 P421 The proposed Algorithm Neighbors cp of P3 = 1
23
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P142 P442 P562 P701 d=1.0 Neighbors The proposed Algorithm
24
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P562 P142 P442 P701 d=1.0 Neighbors The proposed Algorithm cp of P5 = 1
25
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P142 P442 P601 P701 d=1.0 Neighbors The proposed Algorithm cp of P1 = 1
26
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P001 P442 P601 P701 d=1.0 Neighbors The proposed Algorithm
27
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 Sum of edge weights # of edges P442 P001 P601 P701 d=1.0 Neighbor s The proposed Algorithm cp of P4 = 0.75
28
2 2 3 2 2 0 3 2 2 00 2 2 2 2 2 3 0 0 00 2 10 6 6 0 6 6 0 0 6 0 0 6 d=0.9 Neighbors The proposed Algorithm Sum of edge weights # of edges cp- value P001~0.22 P601~0.22 P701~0.22
29
0 2 2 2 2 2 0 0 0 2 6 0 6 6 0 6 0 0 The proposed Algorithm The remaining graph Seed
30
0 2 2 2 2 2 0 0 0 2 6 0 6 6 0 6 0 0 d=1.0 The proposed Algorithm
31
0 2 2 2 2 2 0 0 0 2 6 0 6 6 0 6 0 0 d=1.0 The proposed Algorithm
32
0 2 2 2 2 2 0 0 0 2 6 0 6 6 0 6 0 0 d=1.0 The proposed Algorithm
33
The remaining graph
34
The proposed Algorithm Clustering by the proposed algorithm
35
Example A B D C E L F H G K J I (ⅰ)(ⅰ)
36
1. Input and Initialized cp in =0.4, d in = 0.6 A B D C E L F H G K J I (ⅰ)(ⅰ)
37
A B D C E L F H G K J I 2 2 2 2 2 3 1 2 1 1 0 1 1 1 0 1 1 1 1. Seed Selection-1: calculation of weights of edges
38
1. Seed selection-2: Calculation of weights of nodes A B D C E L F H G K J I (ⅲ)(ⅲ) クラスター 1 のシード選択 2 2 2 2 2 3 1 2 1 1 0 1 1 1 0 1 1 1 6 6 10 8 4 2 2 2 2 2 2 2 Selected seed
39
2. Cluster formation-1 Calculation of weights of nodes A B D C E L F H G K J I (ⅳ)(ⅳ) 2 2 3 2 1 Cluster 1 d 1 =1 クラスター1の形成 2 2 3 2 1 Cluster 1 d 1 =1 Candidate merged to Cluster 1 1
40
2. Cluster formation-2 A B D C E L F H G K J I (ⅴ)(ⅴ) Check thresholds OK d 1 =1/1=1 > 0.6 cp C1 =1/(1*1)=1 > 0.4 (cp in ) 2 2 2 2 2 クラスター1の形成 4 4 3 1 1 Candidate merged to Cluster 1 1
41
2. Cluster formation-3 A B D C E L F H G K J I (ⅵ)(ⅵ) クラスター1の形成 cp A1 =2/(1x2)=1>0.4 Cluster 1 d 1 =3/3=1 2 2 1 2 1 1 3 6 2
42
2. Cluster formation-4 A B D C E L F H G K J I (ⅶ)(ⅶ) クラスター 1 の形成 2 1 1 1 3 Check thresholds OK d 1 =1/1=1 > 0.6 cp B1 =3/(1x3)=1 > 0.4 (cp in ) Candidate merged to Cluster 1
43
2. Cluster formation-5 A B D C E L F H G K J I (ⅷ)(ⅷ) クラスター 1 の形成 01 1 2 0 Check thresholds OK d 1 =8/10=0.8 > 0.6 cp L1 =2/(1*4)=0.5 > 0.4 (cp in ) Candidate merged to Cluster 1
44
2. Cluster formation-6 A B D C E L F H G K J I (ⅸ)(ⅸ) クラスター 1 の探索 0 0 0 0 Check thresholds OK d 1 =10/15=0.67 > 0.6 cp E1 =2/(0.8*5)=0.6 > 0.4 (cp in ) Candidate merged to Cluster 1
45
2. Cluster formation-7 A B D C E L F H G K J I (ⅸ)(ⅸ) クラスター 1 の探索 0 0 0 0 Check thresholds Out d 1 =11/12=0.52 < 0.6 cp E1 =1/(0.52*6)=0.32 < 0.4 (cp in )
46
2. Cluster formation-8 A B D C E L F H G K J I (ⅸ)(ⅸ) クラスター 1 の探索 0 0 0 0 Check thresholds Out d 1 =11/12=0.52 < 0.6 cp F1 =1/(0.52*6)=0.32 < 0.4 (cp in )
47
2. Cluster formation-8 A B D C E L F H G K J I (ⅸ)(ⅸ) クラスター 1 の探索 0 0 0 0 Check thresholds Out d 1 =11/12=0.52 < 0.6 cp F1 =1/(0.52*6)=0.0 < 0.4 (cp in )
48
2. Cluster formation-9: Remove the edges and nodes belonging to Cluster 1 F H G K J I (ⅹ)(ⅹ) クラスター 1 を削除
49
Results of Density Periphery Clustering A B D C E L F H G K J I (ⅹ)(ⅹ) 終了 Cluster 1 d 1 =10/15=0.67 Cluster 2 d 2 =3/3=1 Cluster 3 d 3 =3/3=1 ⅰ
50
Results: Complexes in the E. coli PPI Network The network of E. coli proteins consists of 363 interactions involving a total of 336 proteins DIP:339NGroELDIP:1081NPrnP DIP:1025NCarBDIP:1026NCarA DIP:539NMalGDIP:508NMalE DIP:124NXerDDIP:726NXerC DIP:367NPntBDIP:366NPntA DIP:342NSbcCDIP:572NGam ---------------------------------------------- http://dip.mbi.ucla.edu/
51
components of RNA polymerase (RpoA, RpoB, RpoC, Rsd, RpoZ RpoD, RpoN, FliA) Results: Complexes in the E. coli PPI Network
52
components of ATP synthetase (AtpA, AtpB, AtpE, AtpF, AtpG, AtpH, AtpL); Results: Complexes in the E. coli PPI Network
53
Proteins involved in cell division (FtsQ, FtsI, FtsW, FtsN, FtsK and FtsL) Results: Complexes in the E. coli PPI Network
54
components of DNA polymerase (DnaX, HolA, HolB, HolD, and HolC); Results: Complexes in the E. coli PPI Network
55
We extract a set of 12487 unique binary interactions involving 4648 proteins by discarding self-interactions of the PPI data obtained from ftp://ftpmips.gsf.de/yeast/PPI/. Results: Complexes in the S. cerevisiae PPI Network
56
Results: Details of a Group of Predicted Complexes Information on the complexes that are of size 6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode. We considered 15 functional classes: (1) Cell cycle and DNA processing, (2) Protein with binding function or cofactor requirement (structural or catalytic), (3) Protein fate (folding, modification, destination), (4) Biogenesis of cellular components, (5) Cellular transport, transport facilitation and transport routes, (6) Metabolism, (7) Interaction with the cellular environment, (8) Transcription, (9) Energy, (10) Cell rescue, defense and virulence, (11) Cell type differentiation, (12) Cellular communication/signal transduction mechanism, (13) Protein activity regulation, (14) Protein synthesis, and (15) Transposable elements, viral and plasmid proteins
57
Results: Hypergeometric distribution N= Total number of proteins in the network F= Number of proteins of a functional group in the network C= Number of proteins in a cluster k= Number of proteins of a functional group in a cluster The p-value of a cluster implies the probability that the proteins of the cluster have been randomly selected The lower the p-value the higher the statistical significance
58
3 green and 4 red balls Put them in a box Randomly choose any 3 P 0 (# of red ball is 0) = P 1 (# of red ball is 1) = P 2 (# of red ball is 2) = P 3 (# of red ball is 3) = Notice that, P 0 +P 1 +P 2 +P 3 =1 P-value & Hyper geometric distribution
59
P 0 (# of red ball is 0) = P 1 (# of red ball is 1) = P 2 (# of red ball is 2) = P 3 (# of red ball is 3) = 0132 P-value & Hyper geometric distribution
60
P 0 (# of red ball is 0) = P 1 (# of red ball is 1) = P 2 (# of red ball is 2) = P 3 (# of red ball is 3) = P(# of red ball ≤ 1)= P0 +P1 P(# of red ball ≥ 2)=1-(P0 +P1) P(# of red ball ≥ k)=1-(P0 +P1+…+P k-1 ) N=7, F=4, C=3 P-value & Hyper geometric distribution
61
Results: Details of a Group of Predicted Complexes Information on the complexes that are of size 6 of the set generated using din=0.7, cpin=0.50 and non-overlapping mode. Protein YDR425w of complex 19 is related to cellular transport and YIP1, YGL198w, YGL161c and GCS1 are related to vesicular transport. Hence, we predict the function- unknown protein YPL095c of this complex is a transport related protein most likely related to vesicular transport.
62
Conclusions In this work, we present an algorithm to detect locally dense regions in undirected simple graphs. The algorithm can be used to detect protein complexes in large protein-protein interaction networks or co-expressed gene clusters based on microarray data. It can also be used for protein/gene function prediction by way of finding complexes/clusters in networks consisting of function known and function unknown proteins. Also, DPClus can be applied to other networks where finding cohesive groups is an agenda. The DPClus software is available at http://kanaya.naist.jp/DPClus/
63
Md. Altaf-Ul-Amin, Hisashi Tsuji, Ken Kurokawa, Hiroko Asahi, Yoko Shinbo, Shigehiko Kanaya, “DPClus: A Density-periphery Based Graph Clustering Software Mainly Focused on Detection of Protein Complexes in Interaction Networks”, Journal of Computer Aided Chemistry, Vol.7, 150-156, 2006. 2. The DPClus Software The DPClus software is available at http://kanaya.naist.jp/DPClus/ The DPClus software has been developed based on the proposed algorithm.
64
The main window of DPClus The DPClus Software
65
AtpBAtpA AtpGAtpE AtpAAtpH AtpBAtpH AtpGAtpH AtpEAtpH The input file format 0 0 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 0 List of edges Corresponding network Adjacency matrix The DPClus Software Adjacency list AtpA AtpB, AtpH AtpB AtpA, AtpH AtpH AtpB, AtpA, AtpG, AtpE AtpG AtpH, AtpE AtpE AtpG
66
ClusterLength of cluster 1 is: 8 RpoA RpoB RpoC Rsd RpoZ RpoD RpoN FliA ClusterLength of cluster 2 is: 8 AtpH AtpG AtpB AtpA AtpF AtpL AtpE AtpB(A) ClusterLength of cluster 3 is: 5 -------------------------------------- Output file format The DPClus Software
67
Intra cluster edges are green and inter cluster edges are red Nodes have been arranged by dragging The DPClus Software
68
Click Hierarchical graph of the clusters The DPClus Software
69
Clustering of microarray data Sample microarray data To apply DPCcus, we need to convert this data to a network The DPClus Software
70
Experiment ID Genes Gene-Gene correlation Select highly correlated gene pairs Edges of a Network At3g10060At3g54150 At3g10060At3g63140 At3g10060At5g07020 --------------------------- The DPClus Software
71
# of experiments 626 Threshold correlation 0.95 cp value 0.5 density value 0.9 Minimum cluster size 3 The DPClus Software
72
Ribosomal protein clusters Electron transport clusters Photosynthesis clusters The DPClus Software
73
Line Graphs Given a graph G, its line graph L(G) is a graph such that each vertex of L(G) represents an edge of G; and two vertices of L(G) are adjacent if and only if their corresponding edges share a common endpoint ("are adjacent") in G. Graph GVertices in L(G) constructed from edges in G Added edges in L(G) The line graph L(G) http://en.wikipedia.org/wiki/Line_graph
74
Line Graphs RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs By JOHN W. RAYMOND1, ELEANOR J. GARDINER2 AND PETER WILLETT2 THE COMPUTER JOURNAL, Vol. 45, No. 6, 2002 The above paper has introduced a new graph similarity calculation procedure for comparing labeled graphs. The chemical graphs G1 and G2 are shown in Figure a, and their respective line graphs are depicted in Figure b.
75
Line Graphs Detection of Functional Modules From Protein Interaction Networks By Jose B. Pereira-Leal,1 Anton J. Enright,2 and Christos A. Ouzounis1 PROTEINS: Structure, Function, and Bioinformatics 54:49–57 (2004) Transforming a network of proteins to a network of interactions. a) Schematic representation illustrating a graph representation of protein interactions: nodes correspond to proteins and edges to interactions. b) Schematic representation illustrating the transformation of the protein graph connected by interactions to an interaction graph connected by proteins. Each node represents a binary interaction and edges represent shared proteins. Note that labels that are not shared correspond to terminal nodes in (a) A star is transformed into a clique
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.