Biological networks Theory and applications

Biological networks Theory and applications
Jianhua Ruan Department of Computer Science UTSA

Lecture outline Basic terminology and concepts in networks
Some interesting relationships between network properties and biological functions Network-based approach to complex diseases

Biological networks An abstract model of the complex relationships among molecules in the cell Vertex: molecules Gene, protein, metabolite Edges: relationship Regulation Interaction Many types of biological networks Protein-protein interaction networks Protein-DNA(RNA) interaction networks Genetic interaction network Metabolic network Signal transduction networks Neural networks Etc.

Protein-protein interaction networks
Red: lethal Green: non-lethal Yellow: unknown

A Pathway Example

Obtaining biological networks
Direct experimental methods Protein-protein interaction networks Yeast-2-hybrid Tandem affinity purification Co-immunoprecipitation Protein-DNA interaction Chromatin Immunoprecipitation (followed by microarray or sequencing, ChIP-chip, ChIP-seq)

Yeast-2-hybrid Y2H overview Image courtesy Wikipedia.org

Computational Predictions of PPIs
Empirical predictions Theoretical predictions Coevolution at the residue level Coevolution at the full sequence level

Mirrortree

Why networks? Studying genes/proteins on the network level allows us to: Assess the role of individual genes/proteins in the overall pathway Evaluate redundancy of network components Identify candidate genes involved in genetic diseases Sets up the framework for mathematical models For complex systems, the actual output may not be predictable by looking at only individual components: The whole is greater than the sum of its parts

Structural properties of networks
Degree distribution Average shortest path length Clustering coefficient Community structure Degree correlation Motivation to study structural properties: Structure determines function Functional structural properties may be shared by different types of real networks (bio or non-bio)

Degree of a node (k) Degree of i-th node ki= number of nodes linking with it

Degree of a node (k) kin= number of nodes linking in
kout= number of nodes linking out

Average shortest path length
i j l measures a network’s overall navigability

Clustering Coefficient
ith node has ki neighbors linking with it Ci=2Ei/ki(ki-1)=2/9 Ei is the actual number of links between ki neighbors maximal number of links between ki neighbors is ki(ki-1)/2 The probability that two of your friends are also friends Clustering coefficient of a network: average clustering coefficient of all nodes

Qualitative properties of biological networks
Scale-free Existence of hubs Many nodes with a few connections Small-world (small average path length) High clustering coefficient

Real biological networks: scale-free
Heavy tail degree distribution Power-law distribution P(k) = k-r

Erdos-Renyi random networks
Each pair of nodes have a probability p to form an edge Most nodes have about the same # of connections Degree distribution is binomial or Poisson

Comparing Random and Scale-free distribution
In the random network, the five nodes with the most links (in red) are connected to only 27% of all nodes (green). In the scale-free network, the five most connected nodes (red) are connected to 60% of all nodes (green) (source: Nature)

Robust yet fragile nature of networks

Connectivity vs essentiality
% of essential proteins Number of connections Jeong et. al. Nature 2001

Community role vs essentiality
Effect of a perturbation cannot depend on the node’s degree only! Many hub genes are not essential Some non-hub genes are essential Maybe a gene’s role in her community is also important Local leader? Global leader? Ambassador? Guimerà and Amaral, Nature 433, 2005

Community structure

Role 1, 2, 3: non-hubs with increasing participation indices
Role 5, 6: hubs with increasing participation indices

Dynamically organized modularity in the yeast PPI network
Protein interaction networks are static Two proteins cannot interact if one is not expressed We should look at the gene expression level Han, et. al, Nature 430, 2004

Obtaining Data

Distinguish party hubs from date hubs
Red curve – hubs Cyan curve – nonhubs Black curve – randomized Partners of date hubs are significantly more diverse in spatial distribution than partners of party hubs

Effect of removal of nodes on average geodesic distance
Original Network On removal of date hubs On removal of party hubs Green – nonhub nodes Brown – hubs Red – date hubs Blue – party hubs The ‘breakdown point’ is the threshold after which the main component of the network starts disintegrating.

Dynamically organized modularity
Red circles – Date hubs Blue squares - Modules

Network and disease The challenges in studying genetic causes of complex diseases such as cancer, autism, and diabetes are caused by different genetic perturbations caused by a combinatorial effect of many mutations, so individual effects of each mutation might be small and thus hard to discover disease heterogeneity, like different subtypes autism as an example 31 31

How to approach the complex diseases
The facts genes, gene products, and small molecules interact with each other to form a complex interaction network similar disease phenotypes despite different genetic causes suggest that these different causes are not unrelated but rather dys-regulate the same component of the cellular system Therefore, we should focus on modules or subnetworks of genes 32

Benefits of using molecular interaction
Identify subnetworks including genes not significantly different in disease VS control PPI and gene expression profiles -> Active Subnetworks with common transcription factors A module based approach increases statistical power Identified network modules can provide better understanding of the biological underpinning of the disease 33

Network Modules Enriched with Genetic Alterations
Find candidate genes which altered in diseases Map to interaction network Extract modules: highly interconnected or within close proximity Evaluate statistical significance of selected modules 34

Differentially expressed network modules
Finding modules enriched with genes that have abnormal expression Methods Scoring based methods Correlation based methods Set cover based methods 35

Differentially expressed network modules (Methods)
36

Network-based prediction of breast cancer metastasis
Chuang et. al. Mol systems Biology, 2007

Disease subnetwork identification

Results: Correspondence to hallmarks of cancer
For two datasets of 295 and 286 patients, 149 and 243 (resp.) discriminative subnets found 47% and 65% of subnets enriched for common biological process 66 and 153 subnets were enriched for processes involved in major events of cancer progression -Subnets consist of 618 and 906 genes, respectively. -Enrichment for biological process as annotated by GO (hypergeometric test with FDR of 5%). -To test if functional enrichment was solely due to network topology, they extracted 1000 random subnetworks (regardless of expression). These were enriched at 25% and 26%. -The color of each node scales with the change in expression of the corresponding gene for metastatic versus non-metastatic cancer. The shape of each node indicates whether its gene is significantly differentially expressed (diamond; Po0.05 from a two-tailed t-test) or not (circle). Known breast cancer susceptibility genes are marked by a blue asterisk. 39

Results: Reproducibility
Subnetwork markers significantly more reproducible between datasets than individual gene markers -Agreement in markers selected from the two datasets. 40

Dataset 1 Dataset 2 -MAPK3 was reproducible as a central node in subnetworks identified from both data sets, and it is not significantly differentially expressed. 41

Shared network motifs with differences in differential expression Left-hand side is from Dataset 1 and right-hand side is from Dataset 2 -Network motifs 42

Results: Subnetwork Markers as Classifiers
Averaged expression values for each subnetwork were used as features for a classifier based on logistic regression For comparison, the top individual gene-markers were instead used as features Markers from one dataset were used as predictors of metastasis on the other dataset 43

Results: Subnetwork Markers as Classifiers
Dataset 1 markers tested on Dataset 2, and vice versa 44

Results: Informative of Non-discriminative Disease Genes
Network analyses can identify proteins not differentially expressed, but required to connect higher scoring proteins in a significant subnetwork 85.9 and 96.7% of the significant subnetworks contained at least one protein that was not significantly differentially expressed in metastasis 45

Results: Informative of Non-discriminative Disease Genes
Several established prognostic markers were not present in individual gene expression markers, but played a central, interconnecting role in discriminative subnetworks MYC, ERBB2 46

PPI network can be helpful to explore difference between healthy and disease states
Source: Dynamic modularity in protein interaction networks predicts breast cancer outcome, Nature Biotechnology 27, 2009

Summary Biological networks and real-world networks share important topological properties that likely indicate functional significances Complex diseases can be better understand from perspective of dys-regulated modules than at the individual gene level. Module-based approaches are more robust and improve disease classification. 48

Community discovery: motivations
Biological networks are modular Metabolic pathways Protein complexes Transcriptional regulatory modules Provide a high-level overview of the networks Predict gene functions based on communities

Community discovery problem
Divide a network into relatively densely connected sub-networks Vertex reorder

Challenges How many communities? Is there any community at all?

Community structures Also known as modules
Relatively densely connected sub-network Quite common in real networks Social networks Internet Biological networks Transportation Power grid

Community discovery problem
Divide a network into relatively densely connected sub-networks Vertex reorder

History Social science: clustering
Based on affinities / similarities Need to give # of clusters Can always find clusters Computer science: graph partitioning Minimizing cut / cut ratio Need to give # of partitions Can always produce partitions Preferred approach: natural division Automatically determine # of communities Do not partition if no community

Modularity function (Q)
Measure strength of community structures Newman, Phy Rev E, 2003 Number of communities Expected fraction of edges falling in community i Observed fraction of edges falling in community i -1 < Q < 1 Q = 0 if k = 1 e11 e12 e21 e22

Goal: find the partition that has the highest Q value
But: optimizing Q is NP-hard (Brandes et al., 2006)

Heuristic algorithms k-way spectral partitioning approximately optimizes Q if k is known White & Smyth, SDM 2005 k is unknown: test all possible k’s eig kmeans

k-way spectral partitioning
Q = 0.40 Q = 0.56 Q = 0.54 Good accuracy ~O(n3) time complexity; n: # of vertices

Recursive bi-partitioning
Q = 0.40 x Q = 0.56 Q = 0.54 ~O(m logn) time complexity; m: # of edges Accuracy worse than k-way partitioning

Can we do better? Objectives Ideas
Efficiency of the recursive algorithm Accuracy of the k-way algorithm (or even better) Ideas Flexible l-way recursive partition (l = 2-5) As efficient as recursive bi-partition Accuracy similar to K-way algorithm Ruan and Zhang, ICDM 2007 Take the results of recursive algorithm as the starting point, do local improvement Ruan and Zhang, Physical Review E 2008

Algorithm Qcut Recursive partitioning until local maximum of Q
Refine solution by greedy search Consider two types of operations Move a vertex to a different community Merge two communities Take the one with the largest improvement of Q Repeat until no improvement of Q can be made Go back to step 1 if necessary Key: quickly find out the operation that can give the largest improvement of Q

Identifying candidate moves
If vertex v moves from community i to j xi – degree of v in community i x – degree of v ai – total degree for vertices in community i Compute all potential Q from initial state Update is almost constant for scale-free networks Additional heuristics to improve efficiency

Results on synthetic networks
State of the art: Newman, PNAS 2006 Accuracy Relative Q N_out N_out Relative Q = Qfound − Qtrue

An example Real Structure Vertex reordered
Result of Qcut (Accuracy: 99%) Result of Newman (Accuracy: 77%)

Results on real-world networks
SA: Simulated annealing, Guimera & Amaral, Nature 2005 #Vertices #Edges Modularity Newman SA Qcut Social 67 142 0.573 0.608 0.587 Neuron 297 2359 0.396 0.408 0.398 Ecoli Reg 418 519 0.766 0.752 0.776 Circuit 512 819 0.804 0.670 0.815 Yeast Reg 688 1079 0.759 0.740 Ecoli PPI 1440 5871 0.367 0.387 Internet 3015 5156 0.611 0.624 0.632 Physicists 27519 116181 -- 0.744

Running time (seconds)
#vertices #Edges Running time Newman SA Qcut Social 67 142 0.0 5.4 2.0 Neuron 297 2359 0.4 139 1.9 Ecoli Reg 418 519 0.7 147 12.7 Circuit 512 819 1.8 143 6.1 Yeast Reg 688 1079 3.0 1350 13.4 Ecoli PPI 1440 5871 33.2 5868 41.5 Internet 3015 5156 253.7 11040 43.0 Physicists 27519 116181 -- 2852

Graphical user interface for biologists

A real-world example A classic social network: Karate club
Node – club member; edge – friendship Club was split due to a dispute Can we predict the split given the network?

Network of football teams
Vertices: football teams in NCAA Division I-A Edges: games played in year 2000 110 teams 11 conferences (excluding independents) Most games are within conferences Big 12 Big East

Conference vs. Community
Communities discovered by Qcut / Newman Conferences Mountain West Pacific Ten

Communities discovered
Whose fault is it? Communities discovered by Qcut / Newman Force the two conferences to be separated: Q = Q =

Resolution limit of the Q function
Large network Q1 c1 c2 Large network c1 c2 Q2 C1 and C2 separable only if Q2 – Q1 > 0 Q2 – Q1  a1a2/2M – e12 a1a2/2M: expected # of edges between C1 and C2 e12: actual # of edges between c1 and c2 If C1 and C2 are small relative to the network Expected # edges < 1 C1 and C2 non-separable even if connected by one edge But the edge may be due to noise in data

Resolution limit Optimizing Q Real-world networks
may miss small communities is sensitive to false-positive edges cannot reveal hierarchical structures A community containing some sub-communities Real-world networks contain both large and small communities may have false positive edges Biological data are extremely noisy have hierarchies

A solution: HQcut Ruan & Zhang, Physical Review E 2008
Apply Qcut to get communities with largest Q Recursively search for sub-communities within each community When to stop? Q value of sub-network is small, or Q is not statistically significant Estimated by Monte-Carlo method

Randomize randQ = 0.15  0.016 Q = 0.49 Z-score = ( ) / = 21 Randomize randQ = 0.15  0.016 Q = 0.18 Z-score = ( ) / = 1.9 Randomize randQ = 0.52  0.031 Q = 0.49 Z-score = ( ) / = -1.3

Large network Q = 0.49 Z-score = -1.3 Q = 0.49 Z-score = 21 Q = 0.18

Test on synthetic networks
Network: 1000 vertices Community sizes vary from 15 to 100

Accuracy

Example communities Discovered by Qcut Discovered by HQcut

Results for the NCAA teams
Communities by Qcut/Newman Communities by HQcut Mountain West Pacific Ten

Applications to a PPI network
Protein-protein interaction (PPI) network Vertices: proteins Edges: interactions detected by experiments Motivation: Community = protein complex? Protein complex Group of proteins associated via interactions Elementary functional unit in the cell Prediction from PPI network is important

Experiments Data set Algorithms: Evaluation
A yeast protein-protein interaction network Krogan et.al., Nature. 2006 2708 proteins, 7123 interactions Algorithms: Qcut, HQcut, Newman Evaluation ~300 Known protein complexes in MIPS How well does a community match to a known protein complex?

Results Newman Qcut HQcut # of communities 56 93 316
Max community size 312 264 60 # of matched communities 53 52 216 Communities with matching score = 1 5 (9%) 7 (13%) 43 (20%) Average matching score 0.56 0.55 0.70 # of novel predictions 3 41 100

Communities found by HQcut
Small ribosomal subunit (90%) RNA poly II mediator (83%) Proteasome core (90%) Exosome (94%) gamma-tubulin (77%) respiratory chain complex IV (82%)

Example hierarchical community

Microarray data Data organized into a matrix Analysis techniques
Sample Data organized into a matrix Rows are genes Columns are samples representing different time points, conditions, tissues, etc. Analysis techniques Differential expression analysis Classification and clustering Regulatory network construction Enrichment analysis Characteristics of microarray data High dimensionality and noise Underlying topology unknown, often irregular shape Gene Red: high activity Green: low activity

Microarray data clustering
Sample Analyze genes in each cluster Common functions? Common regulation? Predict functions for unknown genes? Gene Red: high activity Green: low activity Many clustering algorithms available K-means Hierarchical Self organizing maps Parameter hard to tune Does not consider network topology

Network-based data analysis
Sample i Construct Co-expression network j = Gene Genes i and j connected if their expression patterns are “sufficiently similar” Similarity > threshold Long list of references K nearest neighbors Recently became popular Many interesting applications beyond clustering Focus here is clustering

Motivation Can we use the idea of community finding for clustering microarray data? Advantages: Parameter free Network topology considered Constructed network may have other uses

Network-based microarray data analysis
Sample i Construct Co-expression network j = Gene How to get the networks? Threshold-based Nearest neighbors Can we use a complete weight matrix? Complete graph, with weighted edges In general, no, since Q is ill-defined on weighted networks How to determine the right cutoff?

Network-based microarray data analysis
There is an implicit network structure Motivation: true network should be naturally modular Can be measured by modularity (Q) If constructed right, should have the highest Q Condition Clustering gene

…… Method overview Network series Qcut Net_1, Most dense Microarray
data Similarity matrix Net_m, Most sparse Qcut

Method overview (cont’d)
True network Random network Modularity Difference Network density Therefore, use ∆Q to determine the best network parameter and obtain the best community structure We actually run HQcut, a variant of Qcut, in order to avoid resolution limit (Ruan & Zhang, Phys Rev E 2008)

Network construction methods
Value-based method Remove edges with similarities < ε. Fixed ε for all vertices May have problem detecting weakly correlated modules Asymmetric k-nearest neighbors (aKNN) Connect each vertex to k other vertices Fixed k for all vertices (k < 10 good enough) Minimum degree = k. max = ? Sensitive to outliers Mutual k-nearest-neighbors (mKNN) Association confirmed by both ends Maximum degree = k, min = 0. (k larger than in aKNN.) Outlier can be detected. Ruan, ICDM 2009

Results: synthetic data set 1
High dimensional data generated by synDeca. 20 clusters of high dimensional points, plus some scatter points Clusters are of various shapes: eclipse, rectangle, random Accuracy ∆Q

Comparison mKNN-HQcut with the optimum k
mKNN-HQcut with automatically determined k

Results: synthetic data set 2
Gene expression data Thalamuthu et al, 2006 600 data sets ~600 genes, 50 conditions, 15 clusters 0 or 1x outliers Without outliers With outliers mKNN-HQcut With optimal k mKNN-HQcut With auto k

Comparison with other methods

Results on yeast stress response data
3000 genes, 173 samples Best k = 140. Resulting in 75 clusters

Results on yeast stress response data
Enrichment of common functions Accumulative hyper-geometric test Protein biosynthesis (p < 10-96) Peroxisome (p < 10-13) Nuclear transport (p < 10-50) Gene mt ribosome (p < 10-63) DNA repair (p < 10-66) RNA splicing (p < ) Nitrogen compound metabolism (p < 10-37) GO Function Terms

Comparison with k-means
Using automatically determined k = 140 mkNN-HQcut K-means Overall function coherence

Application to Arabidopsis data
~22000 genes, 1138 samples 1150 singletons 800 (300) modules of size >= 10 (20) > 80% (90%) of modules have enriched functions Much more significant than all five existing studies on the same data set Top 40 most significant modules

Cis-regulatory network of Arabidopsis
Motif Module

Beyond gene clusters (1)
Gene specific studies Collaborator is interested in Gibberellins A hormone important for the growth and development of plant Commercially important Biosynthesis and signaling well studied Transcriptional regulation of biosynthesis and signaling not yet clear 3 important gene families, GA20ox, GA3ox and GA2ox for biosynthesis Receptor gene family: GID1A,B,C Analyze the co-expression network around these genes

20ox GID1C GID1A 3ox GA3 20ox5 GID1B 2ox 2ox6 2ox4 2ox8 2ox2 20ox1 3ox2 3ox4 3ox3 2ox3 20ox3 20ox4 20ox2 2ox7 2ox1 3ox1

Beyond gene clusters (2)
Sample Cancer classification Sample: tumor/normal cells Sample Qcut Gene Alizadeh et. al. Nature, 2000

Network of cell samples
Black: normal cells Blue: tumor cells Follicular lymphoma (FL) Transformed cell lines Activated Blood B DLBCL DLBCL Resting Blood B Blood T Diffuse large B-cell Lymphoma (DLBCL) Chronic lymphocytic leukemia (CLL)

Survival rate after chemotherapy
Median survival time: 22.3 months Survival rate: 73% Median survival time: 71.3 months DLBCL-2 DLBCL-1 DLBCL-3 Survival rate: 20% Median survival time: 12.5 months

Beyond gene clustering (3)
Topology vs function % of essential proteins Number of connections Jeong et. al. Nature 2001

Community participation vs. essentiality
Hub Participation < 0.2 % Essential % Essential Non-hub Participation >= 0.2 Number of connections Community participation Key: how to systematically search for such relationships?

Biological networks Theory and applications

Similar presentations

Presentation on theme: "Biological networks Theory and applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biological networks Theory and applications

Similar presentations

Presentation on theme: "Biological networks Theory and applications"— Presentation transcript:

Similar presentations

About project

Feedback