Functional Module Prediction in Protein Interaction Networks Ch. Eslahchi NUS-IPM Workshop 5-7 April 2011
Identifying Modules from Biological Networks
Studying the network of the interactions can help biologists to understand principles of cellular organization and biochemical phenomena.
Functional modules as a critical level of biolog- ical hierarchy and relatively independent units play a special role in biological networks. Since network modules do not occur by chance, identification of modules is likely to capture the biologically meaningful interactions.
Naturally, revealing modular structures in biological networks is a preliminary step for understanding how cells function and how proteins organize into a system.
Many methods based on modeling the PPI data with a graph have been developed for analyzing the network structure of PPI networks. Hierarchical clustering methods have been proven to be a good strategy for metabolic networks and PPI networks.
Ravasz et al. (2002) analyzed the hierarchical organization of modularity in metabolic networks. Brun et al. (2003), Rives and Galitski (2003), and Lu, et al. (2004) applied three different clustering methods respectively, based on different metrics induced by shortest-distance, graphical distances, and probabilistic functions, to analyze the module structure of the yeast protein interaction networks on a clustering tree.
Several papers such as Spirin and Mirny (2003), Bader and Hogue (2003) and Bu et al. (2003) have also shown that network modules which are densely connected within themselves but sparsely connected with the rest of network generally correspond to meaningful biological units such as protein complexes and functional modules.
Several approaches to network clustering that have been used for analyzing PPI networks include edge-betweenness clustering Dunn et al. (2005), identication of k-cores Bader and Hogue (2003), restricted neighborhood search clustering (RNSC) King etal. (2004) and Markov clustering algorithm (MCL) Pereira-Leal etal. (2004).
Spirin and Mirny (2003) detected about 50 network modules by using a combination of three methods (enumeration of complete sub- graphs, super paramagnetic clustering and Monte Carlo simulation), and most of which have been proven to be protein complexes or functional modules.
Most current methods are partition algorithms which mean that each protein belongs to only one specific module. Such algorithms are not suitable for finding overlapping modules. Another problem is that PPI networks are very sparse, while most methods only identify strongly connected subgraphs as modules, so only a few modules were detected.
A novel network clustering method (Clique Percolation Method, CPM) Palla etal.(2005), can reveal overlapping module structure of complex networks. But a distinct shortcoming of its application in PPI networks lies in that the method may be restrictive since the basal element of the method is a 3-clique structure. For example, the spoken-like module can not be detected and when the method is applied to large sparse PPI networks such as fly and worm PPI networks, only a few modules can be detected.
In order to overcome the problem, line graph transformation (LGT), an important graph- theoretical technique was introduced by Shi- Hua Zhang etal.(2006).
Computational method for prediction of functional modules based on gene distribution (i.e., their existences and orders) across multiple microbial genomes, and obtain a gene network in which every pair of genes is associated with a score representing their functional relatedness introduced by Hong wei Wu etal. (2007). Then apply a threshold-based clustering algorithm to this gene network, and obtain modules.
The concept of degree is extended from the single vertex to the sub-graph by Feng Luo etal. (2007) and a formal definition of module in a network is used By them (MoNet). Roger etal. (2010) developed the MoNet to a new algorithm (dMoNet).
Most efforts focused on detecting highly connected clusters. – Ignored the peripheral proteins. – Modules with other topology are not identified. – Modules are isolated and no inter relationship is revealed. Identifying Modules from Biological Networks
Traditional clustering algorithms have been applied to protein interaction networks (PIN) to find biological modules. – Need transforming PIN into weighted networks Weight the protein interactions based on number of experiments that support the interaction (Pereira-Leal et al). Weight with shortest path length (River et al. and Arnau et al. ). – Drawbacks Weights are artificial. “tie in proximity” problem in hierarchical agglomerative clustering (HAC). Identifying Modules from Biological Networks
Previous Methods: Detecting highly connected protein clusters. Problems: 1.Neglect many peripheral proteins that connect to the core protein clusters with few links, even though these peripheral proteins may represent true interactions that have been experimentally verified. 2.Biologically meaningful protein modules that do not have highly connected topologies are ignored by these approaches. 3.Protein clusters detected by these approaches are usually isolated from each other. Identifying Modules from Biological Networks
Previous Methods: C lustering methods have been applied to protein interaction networks to identify biological modules. Weighting: 1.number of experiments that support the interaction. 2.the length of the shortest path between them. Problems: 1.generates many identical distances and leads to generate ambiguous results. The solution is to repeat the algorithm iteratively to eliminate this problem. However, repetitive hierarchical clustering may not be computationally feasible for a large protein interaction networks at a whole-genome level. Identifying Modules from Biological Networks Application of clustering analysis to protein interaction networks usually involves transforming them into weighted networks:
Previous Methods: Dividing the network into sub-networks, and then to identify modules based on their topology. Problems: 1.Does not include a clear definition of module. It does not formally determine which parts of the network are modules. Identifying Modules from Biological Networks
Some previous module definitions do not follow the intuitive concept of module exactly.
Limitation of Global Algorithms Biological networks are incomplete. Each vertex can only belong to one module.
139 Modules Obtained from DIP Yeast core PIN
Interconnected Module Network
Monet Feng Luo etal. (2007)
Monet A new formal definition of network modules A new agglomerative algorithm for assembling modules Application to yeast protein interaction dataset
Degree of Subgraph Given a graph G, let S be a subgraph of G (S G). – The adjacent matrix of sub-graph S and its neighbors N can be given as: – Indegree of S, Ind(S): Where is 1 if both vertex i and vertex j are in sub-graph S and 0 otherwise. – Outdegree of S, Outd(S): Where is 1 if only one of the verteices i and j belong to S and 0 otherwise.
Degree of Subgraph: Example Ind(1) =16 Outd(1)= Ind(2) =7 Outd(2)=4 Ind(3) =8 Outd(3)=5
Modularity The modularity M of a sub-graph S in a given graph G is defined as the ratio of its indegree, ind(S), and outdegree, outd(S):
New Network Module Definition A subgraph S G is a module if M>1. Ind(1) =16 Outd(1)=5 M= Ind(2) =7 Outd(2)=4 M=1.75 Ind(3) =8 Outd(3)=5 M=1.6
Agglomerative Algorithm for Identifying Network Modules Flow chart of the agglomerative algorithm
The Order of Merging Edge Betweenness (Girvan- Newman, 2002) – Defined as the number of shortest paths between all pairs of vertices that run through it. – Edges between modules have higher betweenness values. Betweenness = 20
The Order of Merging (continue) Gradually deleting the edge with the highest betweenness will generate an order of edges. – Edges between modules will be deleted earlier. – Edges inside modules will be deleted later. Reverse the deletion order of edges and use it as the merging order.
When Merging Occurs? Between two non-modules Between a non-module and a module Never between two modules
MF-Algorithm By M. Hbibi, M. Sharifzade and C. Eslahchi
Definitions The number of the edges of, which we call the internal edges of, is: The number of edges with one end in and another end in is called external edges of and is equal to:
For a vertex, the internal and external degree of with respect to is respectively defined by: For predicting modules in a graph, we define a module score (mscore) for and : Definitions and
MF Algorithm Step 1: Assigning white color to all vertices. Sort the vertices according to their degree, and divide this sorted list into four equal (or near equal) parts. AB
MF Algorithm Step 2: If the module score of A in G, mscore(A), is greater than 1, then we consider A as a candidate for module (similarly for B ). Step 3: For each vertex v ∈ A(or B ) with color white we calculate mscore A (v) (or mscore B (v)).
MF Algorithm Step 4: v ∈ A has minimum mscore (among vertices which has color white). If mscore(v)<1. X = X − v and Y = Y + v assign color gray to v, and go to Step 2
Else, if |X| > 3 Otherwise algorithm stops. start the algorithm from Step 1 for G[X] (similarly for G[Y ]). MF Algorithm
Filtering of MF Algorithm Results
Example of Module Overlap
Testing Data Set Yeast Core Protein Interaction Network (PIN). – The yeast core PIN from Database of Interacting Proteins (DIP) (version ScereCR ). – Total: 2609 proteins; 6355 links. – Large component: 2440 proteins, 6401 interactions.
Comparison of MF, MoNet, and MCL P-value shows the statistical significance of a group of genes related to a specific GO (Gene Ontology) term. The more significant modules have p-values closer to zero. The percentage of proteins in each module which are related to a specific GO term is denoted by D.
Some Examples of MF Results MFA not only predicts dense and highly connected modules, but also predicts linear and non-dense ones, like stars. Three of such MFA modules, with various densities and topologies, are shown in the figure:
Conclusions Provide a framework for decomposing the protein interaction network into functional modules The modules obtained appear to be biological functional modules based on clustering of Gene Ontology terms The network of modules provides a plausible way to understanding the interactions between these functional modules With the increasing amounts of protein interaction data available, our approach will help construct a more complete view of interconnected functional modules to better understand the organization of the whole cellular system
Questions?
Local Optimization Algorithm