BIOLOGICAL NETWORKS Woochang Hwang
BIOLOGICAL NETWORKS Introduction Biological Networks Protein-Protein Interaction Networks Signaling & Metabolic Pathway Networks Expression Networks Biological Networks’ Properties Databases Discussion STM Clustering Model Introduce some basic concepts and overviews. Three major categories. Network properties and phenomena Public databases.
Introduction
Bioinformatics Informatics Bioinformatics Its carrier is a set of digital codes and a language. In its manifestation in the space-time continuum, it has utility (e.g. to decrease entropy of an open system). Bioinformatics The essence of life is information (i.e. from digital code to emerging properties of biosystems.) Bioinformatics is the study of information content of life life is tremendously complex. We get huge amount of data from laboratory, dna chip, mass spectrometry, Y2H. ( High throughput assays) Form those data, we try to summarize or mine useful information and patterns of life.
Proteomics Genomics Proteomics Structural Proteomics Functional Proteomics Structure Determination Database / Knowledge Source Homology Modeling Protein-Protein Interaction & Networking Protein Expression Post-tranlational Modification This chart is showing the relationships with genomics, proteomics, and their subcategories.
From the particular to the universal This is the pyramid structure of informatics. As you go from the bottom to top, information quantity and level of complexity is exponentially getting bigger and more complex. A.-L- Barabasi & Z. Oltvai, Science, 2002
Genome Size Left table is showing the genomic data size of each organism. Human genome has tremendously large complexity. Graph on the right is showing the growth of data volume in GenBank. You can see those inceases is exponential.
Proteom Size (PDB)
BIOLOGICAL NETWORK Networks are found in biological systems of varying scales: 1. Evolutionary tree of life 2. Ecological networks 3. Expression networks 4. Regulatory networks - genetic control networks of organisms 5. The protein interaction network in cells 6. The metabolic network in cells … more biological networks Many real world systems can be represented as networks including biological networks.
Why Study Networks? It is increasingly recognized that complex systems cannot be described in a reductionist view. Understanding the behavior of such systems starts with understanding the topology of the corresponding network. Topological information is fundamental in constructing realistic models for the function of the network. We saw the complexity and volume of data sets we are dealing with. It is impossible to analysis or manage their properties and underlying principles due to their complexity.
Biological Network Model A linked list of interconnected nodes. Node Protein, peptide, or non-protein biomolecules. Edges Biological relationships, etc., interactions, regulations, reactions, transformations, activation, inhibitions. Network representation Edges are biochemical processes like reaction, transformation, interaction, activation, inhibition
Biological Network Model It is usually represented by a 2-D diagram with characteristic symbols linking the protein and non-protein entities. Modular and hierarchical A circle indicates a protein or a non-protein biomolecule. An symbol in between indicates the nature of molecule-molecule process (activation, inhibition, association, disassociation, etc.)
Protein Interaction Network
Proteins in a cell There are thousands of different active proteins in a cell acting as: enzymes, catalysors to chemical reactions of the metabolism components of cellular machinery (e.g. ribosomes) regulators of gene expression Certain proteins play specific roles in special cellular compartments. Others move from one compartment to another as “signals”. Diverse functions for proteins What are the functional roles proteins do in a cell.
Protein Interactions Proteins perform a function as a complex rather as a single protein. Knowing whether two proteins interact can help us discover unknown proteins’ functions: If the function of one protein is known, the function of its binding partners are likely to be related- “guilt by association”. Thus, having a good method for detecting interactions can allow us to use a small number of proteins with known function to characterize new proteins. Proteins perform a function as a complex rather as a single protein. So, to understand function of proteins, we need to understand the interactions among proteins. Function prediction
P. Uetz, et al. Nature, 2000; Ito et al., PNAS, 2001; … Protein Interactions Y2H method to detect interaction Bait and Prey protein P. Uetz, et al. Nature, 2000; Ito et al., PNAS, 2001; …
Yeast Protein Interaction Network Nodes: proteins Links: physical interactions (binding) two hybrid analysis has been widely used. Whole yeast protein interacton map
Pathway Networks
Signaling & Metabolic Pathway Network A Pathway can be defined as a modular unit of interacting molecules to fulfill a cellular function. Signaling Pathway Networks In biology a signal or biopotential is an electric quantity (voltage or current or field strength), caused by chemical reactions of charged ions. refer to any process by which a cell converts one kind of signal or stimulus into another. Another use of the term lies in describing the transfer of information between and within cells, as in signal transduction. Metabolic Pathway Networks a series of chemical reactions occurring within a cell, catalyzed by enzymes, resulting in either the formation of a metabolic product to be used or stored by the cell, or the initiation of another metabolic pathway
A Pathway Example MAPK Signaling pathway mitogen-activated protein kinase The Mitogen-Activated Protein Kinase (MAPK) pathways transduce a large variety of external signals, leading to a wide range of cellular responses, including growth, differentiation, inflammation and apoptosis.
A Pathway Example Metabolic pathway for TCA Cycle Node: Substrates and products Edges: Enzymes, not simple binary interaction
A Pathway Example It is a typical example of cellular processes. This illustrate the beginning and end of human cell
Regulatory Network a collection of DNA segments (genes) in a cell which interact with each other and with other substances in the cell, thereby governing the rates at which genes in the network are transcribed into mRNA.
Regulatory Network Relationships among dnas in a cell. How do they activate each other or inhibit each other to function properly.
Expression Network A network representation of genomic data. Inferred from genomic data, i.e. microarray. Similarity co-expressed: lead or follow expression
BIOLOGICAL NETWORK PROPERTY Interaction Network Pathway Network Regulatory Network Expression Network
Biological Networks Properties Power law degree distribution: Rich get richer Small World: A small average path length Mean shortest node-to-node path Robustness: Resilient and have strong resistance to failure on random attacks and vulnerable to targeted attacks Hierarchical Modularity: A large clustering coefficient How many of a node’s neighbors are connected to each other Random attack: targeted attack Can reach any node in less than 5 hops Modular than random networks
Power Law Network PREFERENTIAL ATTACHMENT on Growth: the probability that a new vertex will be connected to vertex i depends on the connectivity of that vertex: High degree nodes have higher probability to get connected to a new added node.
The Barabási-Albert [BA] model ER Model WS Model Actors Power Grid www (a) Random Networks (b) Power law Networks Barabasi group proposed the power law network model Probability of being a degree k node Power Law Network (Scale Free) The probability of finding a highly connected node decreases exponentially with k:
Small World Property A small average path length Any node can be reached within a small number of edges, 4~5 hops.
Power Law Network Power-law degree distribution & Small world phenomena also observed in: communication networks web graphs research citation networks social networks Classical -Erdos-Renyi type random graphs do not exhibit these properties: Links between pairs of fixed set of nodes picked uniformly: Maximum degree logarithmic with network size No hubs to make short connections between nodes Many real world networks show same behaviors; power law , small world Random networks don’t have
Attack Tolerance Complex systems maintain their basic functions even under errors and failures (cell mutations; Internet router breakdowns) node failure Robustness on random attacks. Sensitive to targeted attacks, eg hubs
Attack Tolerance Max Cluster Size Robust. For <3, removing nodes does not break network into islands. Very resistant to random attacks, but attacks targeting key nodes are more dangerous. Max Cluster Size Path Length Max Cluster size changes accoeding to removing nodes Random network is vulnerable for random attacks, power law net is not. Random attack vs targeted atack
Protein Interaction Network Yeast protein interaction map also show the power law property H. Jeong, S.P. Mason, A.-L. Barabasi & Z.N. Oltvai, Nature, 2001
Protein Interaction Network The yeast protein interaction network seems to reveal some basic graph theoretic properties: The frequency of proteins having interactions with exactly k other proteins follows a power law. The network exhibits the small world phenomena: can reach any node within small number of hops, usually 4 or 5 hops Robustness: Resilient and have strong resistance to failure on random attacks and vulnerable to targeted attacks. Random attack: targeted attack Can reach any node in less than 5 hops Modular than random networks
Hierarchical Modularity E. Ravasz et al., Science, 2002
Hierarchical Modularity Same cc distribution for meta net and interaction net Protein Networks Metabolic Networks E. Ravasz et al., Science, 2002
Implications From Observations Biological complexity: # states ~2# of genes. Protein hubs critical for cells, 45% . Infections will target highly connected nodes. Cascading node failures could cause a critical problem. Development of drug and treatment with novel strategies like targeting effective nodes is indispensable. Hubs: good target but more side effects due to their high connectivity
Databases
Protein Databases Swiss-Prot (non-redundant database): Release 41.0, 11/4/2003: 124,464 entries. Release 41.5, 23/4/2002: 125,236 entries. TrEMBL (translations of EMBL nucleotide sequences not yet integrated into Swiss-Prot): Release 23.7, 17/4/2003: 863,248 entries This number keeps rapidly growing mainly due to large scale sequencing projects. Two typical public protein databases Size of the databases.
Protein Interaction Databases Species-specific FlyNets - Gene networks in the fruit fly MIPS - Yeast Genome Database RegulonDB - A DataBase On Transcriptional Regulation in E. Coli SoyBase PIMdb - Drosophila Protein Interaction Map database Function-specific Biocatalysis/Biodegradation Database BRITE - Biomolecular Relations in Information Transmission and Expression COPE - Cytokines Online Pathfinder Encyclopaedia Dynamic Signaling Maps EMP - The Enzymology Database FIMM - A Database of Functional Molecular Immunology CSNDB - Cell Signaling Networks Database List and categorized them Brite- KEGG
Protein Interaction Databases Interaction type-specific DIP - Database of Interacting Proteins DPInteract - DNA-protein interactions Inter-Chain Beta-Sheets (ICBS) - A database of protein-protein interactions mediated by interchain beta-sheet formation Interact - A Protein-Protein Interaction database GeneNet (Gene networks) General BIND - Biomolecular Interaction Network Database BindingDB - The Binding Database MINT - a database of Molecular INTeractions PATIKA - Pathway Analysis Tool for Integration and Knowledge Acquisition PFBP - Protein Function and Biochemical Pathways Project PIM (Protein Interaction Map) It is little bit hard to get familiar with these databases due to biological terms. Need to play with them and get used to them.
Pathway Databases KEGG (Kyoto Encyclopedia of Genes and Genomes) http://www.genome.ad.jp/kegg/ Institute for Chemical Research, Kyoto University PathDB http://www.ncgr.org/pathdb/index.html National Center for Genomic Resources SPAD: Signaling PAthway Database Graduate School of Genetic Resources Technology. Kyushu University. Cytokine Signaling Pathway DB. Dept. of Biochemistry. Kumamoto Univ. EcoCyc and MetaCyc Stanford Research Institute BIND (Biomolecular Interaction Network Database) UBC, Univ. of Toronto These databases also includes protein information and other biological network informations.
KEGG Pathway Database: Computerize current knowledge of molecular and cellular biology in terms of the pathway of interacting molecules or genes. Genes Database: Maintain gene catalogs of all sequenced organisms and link each gene product to a pathway component Ligand Database: Organize a database of all chemical compounds in living cells and link each compound to a pathway component Pathway Tools: Develop new bioinformatics technologies for functional genomics, such as pathway comparison, pathway reconstruction, and pathway design Briefly introduce KEGG, which is the most popular and biggest pathway databases
This is the expanded
In html and xml format
Discussion Problems Network Inference Micro Array, Protein Chips, other high throughput assay methods Function prediction The function of 40-50% of the new proteins is unknown Understanding biological function is important for: Study of fundamental biological processes Drug design Genetic engineering Functional module detection Cluster analysis Topological Analysis Descriptive and Structural Locality Analysis Essential Component Analysis Dynamics Analysis Signal Flow Analysis Metabolic Flux Analysis Steady State, Response, Fluctuation Analysis Evolution Analysis Biological Networks are very rich networks with very limited, noisy, and incomplete information. Discovering underlying principles is very challenging. Ongoing researches and possible problems
Signal Transduction Model Based Functional Module Detection Algorithm for Protein-Protein Interaction Networks Woochang Hwang1 Young-Rae Cho1 Aidong Zhang1 Murali Ramanathan2 1Department of Computer Science and Engineering, State University of New York at Buffalo 2Department of Pharmaceutical Sciences, Good morning everyone. My name is woochang hwang. Today, I will present a new statistical clustering approach for ppi networks which we call Signal transduction model based functional module detection algorithm. University at Buffalo The State University of New York
Contents Introduction Protein Interaction Networks Functional Categories Functional Module Detection Algorithm Signal Transduction Model (STM) Experimental Results Discussion Future Works Here is the outline of my talk today. First, some background and introduction of this work will be presented. Then, I will talk about the properties of ppi networks and functional categories available to public. Then , our new clustering model will be introduced and experimental results on the yeast ppi network will be explained in detail. Finally, some discussion about the performances and future possible works.
Introduction Cellular Functions are coordinately carried out by groups of genes and gene products. Detection of such functional modules in a complex molecular network is one of the most challenging problem. Molecular networks: high data volume, high noise level, sparse connectivity, etc. PPI data S. Cerevisae full PPI data in DIP: over 4900 proteins and 18000 interactions. PPI data provide us the good opportunity to analyze the underlying principles and the structure of large living systems. From the biological viewpoint, Cellular functions are carried out by groups of genes and genes products, proteins. We are working on molecular network data and try to find some useful information form them. For example, yeast PPi network has over 4900 proteins and over 18000 interactions in it. So, detecting such functional modules from a complex molecular network should be difficult and important problem due to high complexity, high noise level, sparse connectivity inside them. PPI data set provide very useful information about complex systems, and provide good opportunity to analyze and understand the underlying principles and the structure of large living systems.
Cluster Assessment Clustering Coefficient: Betweeness Centrality: N(v) is the set of the direct neighbors of node v and d(v) is the number of the direct neighbors of node v Betweeness Centrality: is the number of shortest paths from node s to t and (v) the number of shortest paths from s to t that pass through the node v. P-value: C is the size of the cluster containing k proteins with a given function; G is the size of the universal set of proteins of known proteins and contains n proteins with the function. The p-value is the probability that a cluster would be enriched with proteins with a particular function by chance alone. Density: n is the number of proteins and e is the number of interactions in a sub graph s of a PPI network. Before I go into main part of my talk. I want to introduce briefly some metrics for graph properties and cluster quality. The structures of the clusters identified by STM and other competing alternative approaches are assessed using several metrics including these measurements. Clustering coefficient measures the connectivity in the direct neighbors of a node. Betweeness centrality measures the number of shortest paths passing through a node in a graph. . So lower is the better. Lower p value means that the detected cluster is not formed randomly. Density measures the connectivity in a graph.
Protein-Protein Interaction (PPI) Data & MIPS Functional Category Data DIP Yeast Protein Interaction core data 2521 proteins, 5949 interactions Average clustering coefficient: 0.069 Average path length: 5.47 MIPS Functional Category 457 Hierarchical Functional Categories Sub graphs of each functional categories are extracted from DIP core data. Average graph density: 0.0025 Average diameter (longest path in a graph): 4.23 We work on these two data sets mainly in this research. First. For yeast ppi data, we obtained ppi core data set from DIP database. Which has 2521 proteins and 5000 interaction in it. And its average clustering coefficient is 0.069 which is fairly lower then we expect. Its Average path length is 5.47, which reveals the small world phenomenon.
MIPS functional modules in DIP Protein-Protein Interaction (PPI) Network Two example functional categories are shown in this slide. Left figure is the sub graph of Mitochodrial Transport function and Mitosis on the right. I extracted the connectivity of these MIPS function categories from DIP core PPI network. As you can without a problem, you can easily find singletons and isolated part in each MIPS functional category. And furthermore, the diameter of these functions are pretty long which are close to the average path length of the whole PPI network. Figure 1. (a) Mitochodrial Transport 19 singletons Diameter: 6 (b) Mitosis 20 singletons Diameter: 3
Topological Properties of MIPS Functional Modules in DIP Protein Interaction Data Sparse connectivity : low density, isolated sub graphs and singletons existence. Longish shape: high diameter So we can observe that the topological connectivity and shapes of actual MIPS functions are sparse and longish, respectively. Once again, They have low density, isolated sub parts and singletons, and high diameter.
Related works Distance Based Approaches Several distance metrics were introduced Use traditional clustering algorithms Graph Based Approaches Density based approaches: Maximal Cliques, Quasi Cliques, RNSC, HCS, MCODE Statistical approaches: MCL, Samantha So far, we have discus briefly about the topological property of PPI network and functional category. In this slide, I will briefly mention existing clustering approaches in this area. Existing methods can be divided into two major categories according to their approach. First, Distance based approaches, several distance metrics were introduced, but majority of them used traditional clustering algorithm for clustering. In graph based approaches, some methods try to find clusters using density of the modules. Maxcliq finds the sub graphs with fully connected, but complete connectivity is too strict condition, So, some applied approaches were introduced, Qasicliq. They are basically the same in finding modules of certain density. MCL and Samantha took statistical approach to find clusters in networks.
Related works Suffered by their limited way of clustering. identify only the clusters with specific shapes, e.g., balanced round shapes, with high density . But, the actual functional modules are not so densely connected as they expected. Some members in functional categories do not have direct physical interaction with other members of the functional category they belong to. Modules that have longish shapes are frequently observed. The incompleteness of clustering is another distinct drawback of existing algorithms, which produce many clusters with small size and singletons. But they are suffered by their limited way of looking clustering. They concentrate only on densely connected regions and try find clusters with high density and balanced shapes, roundly shapes. But as we saw earlier slides, actual topological properties and shapes of functional categories are not that we expected. They are sparsely connected, have isolated sub parts and singletons, and have longish shapes having long diameter. And in addition to the above, the incompleteness resulted from over emphasis on dense regions is another big disadvantage.
Contribution STM Clustering Model Unexpected properties of functional categories and sparse connectivity in PPI networks. A relative excess of emphasis on density in the existing methods can be preferential for detecting clusters with relatively balanced round shapes, high discarding rate, and limit performance. STM Clustering Model Effective clustering should be able to detect clusters with arbitrary shape and density if the cluster members share biological and topological similarities. To take those unexpected properties of PPI networks and actual functional modules into consideration and to conquer the drawbacks of existing approaches effectively: STM clustering model utilizes a statistical signal transduction model to find the modules whose members share biological common feature even though they are sparsely connected. STM model also adopts the network’s topological properties into the model. I listed four major clustering processes. First, it simulates the dynamic signal transduction behavior according to the formula 2. Then, each node will select their representative which scored highest value based on formular 2. In process 3, preliminary clusters will be formed by accumulating each node toward its representatives. Then, preliminary clusters will be merged if they have significant interconnections.
STM Clustering Model Process 1: Simulation of dynamic statistical signal transduction behavior in the network. STM model simulates dynamic signal transduction behavior to find the most influential proteins on each protein in PPI network biologically and topologically. Process 2: Selection of the putative cluster representatives on each node. Process 3: Preliminary clusters formation. Preliminary clusters will be formed by accumulating each node toward its chosen representatives. Process 4: Cluster merge. So far, STM has considered only the biological features and topological connectivity of the network and its components, not similarity among preliminary clusters. Clusters that have significant interconnections between them should have substantial similarity. In process 4, STM will merge the clusters which has substantial similarity. I listed four major clustering processes. First, it simulates the dynamic signal transduction behavior according to the formula 2. Then, each node will select their representative which scored highest value based on formular 2. In process 3, preliminary clusters will be formed by accumulating each node toward its representatives. Then, preliminary clusters will be merged if they have significant interconnections.
Statistical Signal Transduction Model Signal transduction behavior of the network is modeled by the Erlang distribution, a special case of the Gamma distribution. (1) where c > 0 is the shape parameter, b > 0 is the scale parameter, x >= 0 is the independent variable, usually time. The Erlang distribution with x/b = 1 is used and the value of c is set to the number of nodes between source protein node and the target protein Setting the value of x/b to unity assesses the perturbation at the target protein when the perturbation reaches 1/e of its initial value at the nearest neighbor of the source protein node. From those observations, to overcome those drawbacks of existing methods, we introduce Signal transduction model based clustering approach, here. Here is the Erlang distribution function. We will use this to measure the perturbation from a source node to a target node. Where c is the shape parameter, b is the scale parameter, and x is the independent variable, usually time. x/b=1 is used and c is set to the number of nodes (tasks) between source node and target node in each node pair. Setting x/b to unity is to asses the perturbation at the target protein when the perturbation reaches 1/e of its initial value at the nearest neighbor of the source protein node.
Statistical Signal Transduction Model Statistically, the Erlang distribution represents the time required to carry out a sequence of c tasks whose durations are identical, exponential probability distributions. It represents the chance that the actual time to accomplish c tasks will be less than or equal to b. Here is a simple signal transduction model based on Erlang distribution. C is the number of tasks to be accomplished. B is the time constant, which is not a fix time, for signal transfer. It represents that 1/e chance that the actual time will be less than or equal to b to accomplish c tasks. Figure 2. The pharmacodynamic signal transduction model whose bolus response is an Erlang distribution. The b is the time constant for signal transfer and c is the number of compartments.
Topologically Modified Signal Transduction Model The Erlang distribution was further weighted to reflect network topology. (2) d(i) is the degree of node i, P(v,w) is the set of all visited nodes on the shortest path from node v to node w excluding the source node v and target node w, and F(c) is the signal transduction behavior function. The perturbation induced by the source protein node was assumed to be proportional to its degree and to follow the shortest path to the target protein node. Our choice of the shortest path is motivated by the finding that the majority of flux prefers the path of least resistance in many physicochemical and biological systems. During transduction to the target protein node, the perturbation was assumed to be dissipated at each intermediate node visited in proportion to the reciprocal of the degree of each intermediate node visited. In addition to the previous Erlang distribution, we adopt topological property of the network into it. We assume, during transduction from source protein to target protein, the perturbation should be dissipated at each intermediate visiting node in the proportion to the reciprocal of the degree of them. Our choice of the shortest path is motivated by the finding that the majority of flux prefers the path of least resistance in many physicochemical and biological systems.
Process 1: Signal Transduction Simulation Here is a small simple example. Each box contains the numerical values, obtained from Equation 2, from nodes A, F, G, and H to other target nodes. Results for other nodes are not shown. For example, node G wil choose node F and H as its representatives since they scored highest in node G. And node G will be belong to the clusters formed by node F and H. Node A will choose itself as its representative since the highest scored node in node A is itself. Figure 3. Blue arrows are signals from node A and Red ones are from node H. Results for other nodes are not shown.
Process 1: Signal Transduction Simulation Here is a small simple example. Each box contains the numerical values, obtained from Equation 2, from nodes A, F, G, and H to other target nodes. Results for other nodes are not shown. For example, node G wil choose node F and H as its representatives since they scored highest in node G. And node G will be belong to the clusters formed by node F and H. Node A will choose itself as its representative since the highest scored node in node A is itself. Figure 3. Blue arrows are signal from node A and Red ones are from node H. Results for other nodes are not shown.
Process 1: Signal Transduction Simulation Here is a small simple example. Each box contains the numerical values, obtained from Equation 2, from nodes A, F, G, and H to other target nodes. Results for other nodes are not shown. For example, node G wil choose node F and H as its representatives since they scored highest in node G. And node G will be belong to the clusters formed by node F and H. Node A will choose itself as its representative since the highest scored node in node A is itself. Figure 3. Blue arrows are signal from node A and Red ones are from node H. Results for other nodes are not shown.
Process 1: Signal Transduction Simulation Here is a small simple example. Each box contains the numerical values, obtained from Equation 2, from nodes A, F, G, and H to other target nodes. Results for other nodes are not shown. For example, node G wil choose node F and H as its representatives since they scored highest in node G. And node G will be belong to the clusters formed by node F and H. Node A will choose itself as its representative since the highest scored node in node A is itself. Figure 3. Blue arrows are signal from node A and Red ones are from node H. Results for other nodes are not shown.
Process 2: Representatives Selection Here is a small simple example. Each box contains the numerical values, obtained from Equation 2, from nodes A, F, G, and H to other target nodes. Results for other nodes are not shown. For example, node G wil choose node F and H as its representatives since they scored highest in node G. And node G will be belong to the clusters formed by node F and H. Node A will choose itself as its representative since the highest scored node in node A is itself. Figure 4. A simple network. Each box contains the numerical values obtained from Equation 2, from source nodes A, F, G, and H to other target nodes although signals should be propagated from every node in the network. Results for other nodes are not shown.
Process 3: Preliminary Clusters Formulation Figure 5. Three preliminary clusters, {A, B, C, D, E, F}, {F, G, L, N}, {G, H, I, J, K, M}, are obtained after the Process 3.
Cluster Merge Similarity of two clusters i and j (3) where interconnectivity(i, j) is the number of connections between clusters i and j, and minsize(i, j) is the size of the smaller cluster among clusters i and j. The pair of clusters that have the highest similarity are merged in each iteration and the merge process iterates until the highest similarity of all cluster pairs is less than a given threshold. We see when interconnectivity(i, j)>=minsize(i, j), clusters i and j have substantial interconnections. In merge process, we define the similarity of two clusters as follows. Similarity will be the ratio of interconnectivity between cluster i and j to the size of the smaller cluster out of clusters i and j. Our clustering method will merge clusters that have the highest similarity in each iteration and the merge iterates until the highest similarity of all cluster pairs is less than a given threshold. Experimentally, we see when interconnectivity(i, j)>=minsize(i, j), clusters i and j have substantial interconnections.
Process 4: Cluster Merge Figure 6. Two clusters, {A, B, C, D, E, F, G, L, N}, {G, H, I, J, K, M}, are obtained after the Merge process when 1.0 is used as the merge threshold.
Process 4: Cluster Merge Figure 7. Three clusters, {A, B, C, D, E, F}, {F, G, L, N}, {G, H, I, J, K, M}, are obtained after the Process 4 when 2.0 is used as the merge threshold.
Experimental Results Protein Interaction Data The core data of S. Cerevisiae was obtained from the DIP database. 2526 proteins and 5949 filtered reliable physical interactions. Species such as S. Cerevisae provide important test beds for the study of the PPI networks since it is a well-studied organism for which most proteomics data is available for the organism, by virtue of the availability of a defined and relatively stable proteome, full genome clone libraries, established molecular biology experimental techniques and an assortment of well designed genomics databases. Now we will see the performance of our method on the yeast ppi network. We obtained yeast core ppi data form DIP database and it has 2526 proteins and 5949 interactions. The reason that we chose the yeast ppi data set since Species such as S. Cerevisae provide important test beds for the study of the PPI networks since it is a well-studied organism for which most proteomics data is available for the organism.
Clustering Performance Analysis 60 clusters Average size: 40.1 Average Density: 0.2145 Average P-value: 13.7 Average Hit %: 51.7 Average Unknown %: 5.1 These cluster list contains all 60 clusters that has more than 4 members in them. A Function, that has the best p-value for each cluster, is assigned for each cluster as its major function. Biggest cluster has 214 members and 40.1 on average size. We have 0,2145 for average cluster density. And 51.7 average hit percentage onto assigned function. 5.1% are unannotated functionally. Table 1. all 60 clusters that have more than 4 proteins
Comparative Analysis Table 2. Performance analyses of the clusters more than size 4. Other methods can only detect the clusters with small size. Relatively high P-scores regarding their high discarding rates on other methods (e.g., Maximal Clique, Quasi Clique, Samantha) Due to the mass production of small size clusters which have less than 5 members Due to the discard of sparsely connected proteins. Due to high overlaps among many small clusters which are highly enriched for the same function. Our identified clusters are compared with the clusters detected by other clustering approaches. Six other approaches are compared. As you can see in the table, our method outperformed other methods. Our method has p-values that are 2.2 orders of magnitude or approximately 125-fold lower than Quasi clique, the best performing alternative clustering method, on biological function. And also shows good performance on localizations, which is on the last column.
Computational Complexity Our signal transduction based model is fundamentally established on all pairs shortest path searching algorithm to measure the distance between all pairs of nodes: O(V2logV+VE) where V is the number of nodes and E is the number of edges in a network. The time required to find the best cluster pair that has the most interconnections is O(k2logk) by using heap-based priority queue, where k is the number of preliminary clusters. But k is much smaller than V in sparse networks like the Yeast PPI network. So the total time complexity of our algorithm is bounded by the time consumed in measuring the distance between all pairs of nodes, which is O(V2logV+VE). We saw the outstanding performance of our method. Now we need to consider the computational complexity. Basically, our method based on the all pair shortest path problem in a graph. This can be implemented in O(V2logV+VE) using Johnson’s algorithm. and the time required to find the best cluster pair that has the most similariy is O(k2logk) using heap-based priority queue, where k is the number clusters. And k is much smaller than V. So the total time complexity of our method is bounded by the all pairs shortest problem, which is O(V2logV+VE).
Discussion In head-to-head comparisons, our algorithm outperformed competing approaches and is capable of effectively detecting both dense and sparsely connected, biologically relevant functional modules with fewer discards. The clusters identified had p-values that are 2.2 orders of magnitude or approximately 125-fold lower than Quasi clique, the best performing alternative clustering method, on biological function. The incompleteness of clustering is another distinct drawback of existing algorithms, which produce many clusters with small size and singletons. Our method discarded only about 7.8% of proteins which is tremendously lower than the other approaches did, 59% in average. In conclusion, our method has strong pharmacodynamics-based underpinnings and is an effective, versatile approach for analyzing protein-protein interactions. Our method shows good performance on biological functions and localizations. Discarding too many nodes since they concentrate only on dense regions.
Thanks!