Biological Network Analysis Kimberly Glass BIO508 April 9, 2014
Outline Network models Network properties Network paths Network motifs Information flow Graph clustering Biological networks Relational networks Correlative networks Causative/regulatory networks Applications Biological data integration Function prediction Resources and tools
Outline Network models Network properties Network paths Network motifs Information flow Graph clustering Biological networks Relational networks Correlative networks Causative/regulatory networks Applications Biological data integration Function prediction Resources and tools
The Internet colored by IP address http://www.jeffkennedyassociates.com:16080/connections/concept/image.html
Co-authorship of scientific articles http://www.jeffkennedyassociates.com:16080/connections/concept/image.html
Networks in Molecular Biology Protein-Protein interactions Protein-DNA interactions Genetic interactions Metabolic reactions Co-expression interactions Text mining interactions Association Networks Etc. Barabasi & Oltvai, Nature Reviews, 2004
Graphs Graph G=(V,E) is a set of vertices V and edges E V = {v1, v2, v3, v4, v5} E = {(v1, v2), (v1, v3), (v2, v4), (v2, v5) , (v3, v5)} A subgraph G’ of G is induced by some V’ V and E’ E For example, V’ = {v1, v2, v3} and E’ = {(v1, v2), (v1, v3)} Graph properties: Directed vs. undirected Weighted vs. unweighted Cyclic vs. acyclic Connectivity (node degree, paths) v2 v5 v3 v1 v2 v3
Networks and Graphs: Terminology Formally, a network is a graph is… G = (V, E), an ordered tuple of two sets V = {v1, …, vn}, a set of unique nodes, and E = {(vi, vj), …}, a set of (un)ordered node tuples Bipartite Cyclic Multigraph Acyclic (DAG) Weighted 0.5 1.2 6 -2 Loops (Self-connections) Undirected Directed
Sparse vs Dense G(V, E) where |V|=n, |E|=m the number of vertices and edges Graph is sparse if m~n Graph is dense if m~n2 Complete graph when m=n2
Connected Components G(V,E) |V| = 69 |E| = 71
Connected Components G(V,E) |V| = 69 |E| = 71 6 connected components
Paths A path is a sequence {x1, x2,…, xn} such that (x1,x2), (x2,x3), …, (xn-1,xn) are edges of the graph. A closed path xn=x1 on a graph is called a graph cycle or circuit.
Shortest-Path between nodes
Shortest-Path between nodes
Longest Shortest-Path
Network paths and diameter Shortest path: Connect two nodes by as few edges as possible Network diameter: The longest shortest path in the network The network diameter is often very short: ‘Small world network’
Network Motifs: Simple Building Blocks of Complex Networks Milo, Alon, et. al. Science. 2002 Oct 25;298(5594):824-7
Network Motifs Feedback Positive auto-regulation Negative auto-regulation memory delay speed + stability Coherent feed-forward Bi-fan filter Incoherent feed-forward Whole Genome Duplication and evolvability pulse
Network Motifs: Simple Building Blocks of Complex Networks Milo, Alon, et. al. Science. 2002 Oct 25;298(5594):824-7
Network Motifs: Simple Building Blocks of Complex Networks Shen-Orr, Alon et.al. Nature Genetics, 2002 May;31(1):64-8.
Degree or connectivity
Random vs scale-free networks P(k) is probability of each degree k, i.e fraction of nodes having that degree. For random networks, P(k) is normally distributed. For real networks the distribution is often a power-law: P(k) ~ k-g Such networks are said to be scale-free
Knock-out lethality and connectivity
Clustering coefficient The density of the network surrounding node I, characterized as the number of triangles through I. Related to network modularity k: neighbors of I nI: edges between node I’s neighbors The center node has 8 (grey) neighbors There are 4 edges between the neighbors C = 2*4 /(8*(8-1)) = 8/56 = 1/7
Mixing Properties of Networks Assortative Network Nodes tend to connect to other nodes of similar degree Disassortative Network http://en.wikipedia.org/wiki/Assortativity Nodes tend to connect to other nodes of dissimilar degree
Network Structure: Hubs, Bottlenecks, and Information Flow 26
Network Structure: Cliques and Clusters Clique: fully connected subgraph Quasi-clique: near-miss k-clique: clique of size exactly k Maximal clique: largest clique in graph http://science.cancerresearchuk.org/sci/lightmicro/images/116771 http://scienceblogs.com/goodmath/upload/2007/07/maximal-cliques.jpg http://en.wikipedia.org/wiki/Community_structure
Outline Networks as a model Network properties Network paths Network motifs Information flow Graph clustering Biological networks Relational networks Correlative networks Causative/regulatory networks Applications Biological data integration Function prediction Resources and tools
How is biological data represented in networks? High Correlation Low Gene expression Physical PPIs Genetic interactions Colocalization Sequence Protein domains Regulatory binding sites … + =
Building and Interpreting Biological Networks How we build a biological network depends on what data we have AND what we want the edges in the network to represent. The meaning of the edges in a biological network depend on the method used to generate those edges. Influences how we interpret the interactions in a network. node: an object in the network (e.g. genes) edge: indicates relationship between two nodes
Interpreting the “edges” in Biological Networks Relational Networks Generally Undirected (non-causal relationships) Nodes all of same “type” Generally no “signs” on edges Example: Protein A is a dimerization partner with protein B. A B Correlation Network Undirected (non-causal relationships) Nodes all of same “type” Edges can have “signs” Example: When the expression of Gene A changes, so does the expression for Gene B. A B *Correlation is not causation. Regulatory Network Directed Network (causal relationships) Can have “types” of nodes Edges can have “signs” Example: TF A regulates Gene B. A B
Types of Protein Interactions Physical Protein Interactions Edge between proteins if they physically interact Wild Type Viable Cell Death X Synthetic Lethality Edge between proteins if mutating both causes lethality
Functional Associations Between Processes Edges Associations between processes Very Strong Moderately Gene Ontology: structured as a directed acyclic graph (DAG) Ashburger et al. Gene Ontology: tool for the unification of biology. Nature Genetics 2000.
Functional Associations Between Genes Level of shared function between genes Edge between two genes if they are involved in many of the same biological processes
Interpreting the “edges” in Biological Networks Relational Networks Generally Undirected (non-causal relationships) Nodes all of same “type” Generally no “signs” on edges Example: Protein A is a dimerization partner with protein B. A B Correlation Network Undirected (non-causal relationships) Nodes all of same “type” Edges can have “signs” Example: When the expression of Gene A changes, so does the expression for Gene B. A B *Correlation is not causation. Regulatory Network Directed Network (causal relationships) Can have “types” of nodes Edges can have “signs” Example: TF A regulates Gene B. A B
Network inference from expression data Margolin and Califano, Ann. N.Y. Acad. Sci. 1115: 51–72 (2007). Differential equations Boolean Networks Linear Regression Bayesian networks Information theoretic models Latent variable networks conditions genes Focusing on gene expression is a simplification. But let’s us to put our hand on it.
Correlation is the simplest metric for co-expression genes genes conditions genes
Mutual Information is a Measure of Non-linear Correlation Pearson correlation value Source: http://en.wikipedia.org/wiki/Correlation_and_dependence
Mutual Information (MI) Definition Properties Measures how much knowing one of these variables reduces uncertainty about the other Positive and symmetric Invariant under nonlinear transformation Network Reconstruction Algorithms that use MI: ARACNE CLR
(Algorithm for the Reconstruction of Accurate Cellular Networks) ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) Margolin, Califano et al. BMC Bioinformatics. 2005 Mar 20;7 Suppl 1:S7.
(Algorithm for the Reconstruction of Accurate Cellular Networks) ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) Margolin, Califano et al. BMC Bioinformatics. 2005 Mar 20;7 Suppl 1:S7. Key Idea: Remove indirect relationships.
CLR (Context Likelihood of Relatedness) Faith, Gardner et al. PLoS Biol. 2007 Jan;5(1):e8.
CLR (Context Likelihood of Relatedness) Faith, Gardner et al. PLoS Biol. 2007 Jan;5(1):e8. Key Idea: Normalize the MI for each gene pair against its corresponding background.
Interpreting the “edges” in Biological Networks Relational Networks Generally Undirected (non-causal relationships) Nodes all of same “type” Generally no “signs” on edges Example: Protein A is a dimerization partner with protein B. A B Correlation Network Undirected (non-causal relationships) Nodes all of same “type” Edges can have “signs” Example: When the expression of Gene A changes, so does the expression for Gene B. A B *Correlation is not causation. Regulatory Network Directed Network (causal relationships) Can have “types” of nodes Edges can have “signs” Example: TF A regulates Gene B. A B
Thinking of Gene Regulation As a Network Nodes are genes, edges indicate causal relationships between genes (“TF A regulates gene B”) Networks are directed, from transcription factors to target genes (some of which are also transcription factors) Edges in gene regulatory networks can have signs corresponding to target gene activation (increased transcription) and gene repression (prevention of transcription) note that edge signs are hard to measure in practice. Transcription Factor Target Gene TF A activates gene B Transcription Factor Target Gene TF A represses gene B
How Can We Model GRNs in Human Systems? TF1 TF2 TF3 TF-Gene Regulation Data Two main ways to produce this type of network: G1 TF1 Experimentally Computationally Technique: ChIP-chip Technique: DNA sequence scan for TF binding sites Limitations: very expensive, limited number of ChIP antibodies Limitations: only know recognitions sequences for 10-20% of TFs, prone to false positives, not environment-specific Strength: High quality, environment-specific Strengths: cheap G2 G3 TF2 G4 G5 TF3 TF4 G6
Outline Networks as a model Network properties Network paths Network motifs Information flow Graph clustering Biological networks Relational networks Correlative networks Causative/regulatory networks Applications Biological data integration Function prediction Resources and tools
Incorporating Epigenetic Information With TF Sequence-motif Data All potential interactions Motif found within gene’s promoter Interactions with Epigenetic Evidence Motif found in gene’s promoter and located in region of open chromatin Epigenetic data motif TF1 Gene1 Gene1 Gene2 Gene3 Gene4 Open Chromatin (DNase hypersensitivity site)
Relationship between Expression Information and Gene Regulation Experimental (ChIP-chip) Computational (motif) Gene Expression Limited antibodies (sparse) Quality of PWM Large amount of data Environment specific Not environment specific Non-functional targets Non-functional sequences Correlation is not causation “Good quality, sparse, expensive” “Poor quality, dense, cheap” Regulatory Network combination
Relationship between Expression Information and Gene Regulation Correlation of expression might occur when: One gene regulates another Two genes are regulated by the same TF. Gene Expression Large amount of data Environment specific Correlation is not causation TF TF is expressed Sometime later….. genes are expressed Correlation in two genes’ expression patterns is actually more often a measure of co-regulation
Relationship between Expression Information and Gene Regulation ? TF1 G2 G1 Correlated expression Example: G2 The expression of G1 and G2 is highly correlated Since TF1 targets G1, there is a higher possibility that TF1 also regulated G2.
Protein Interaction Is Related to Regulation Some transcription factors don’t bind a particular DNA sequence. TFs can regulate a gene: Through direct interaction with the control (promoter) region of that gene. By forming a complex with other TFs which directly interact with the promoter region of that gene. We can model protein interactions as a network.
Protein-Protein Interaction Data TF-Gene Regulation Data Relationship between Protein Interaction Information and Gene Regulation Protein-Protein Interaction Data TF-Gene Regulation Data G1 TF1 TF1 TF4 G2 G3 TF5 TF2 TF2 G4 G5 TF3 TF3 TF4 Know recognition sequence
Protein-Protein Interaction Data TF-Gene Regulation Data Relationship between Protein Interaction Information and Gene Regulation Protein-Protein Interaction Data TF-Gene Regulation Data G1 TF1 TF1 TF4 G2 G3 TF5 TF2 TF2 G4 G5 TF3 TF3 TF4
Relationship between Protein Interaction Information and Gene Regulation Integrated Network Example: G3 TF1 and TF2 are potential regulators. Since TF5 interacts with both TF1 and TF2, there is higher possibility that TF5 is also involved in the regulation of G3. G1 TF1 G2 G3 TF5 TF2 G4 G5 TF3 TF4 TF-Gene Regulation Protein-Protein Interaction
Outline Networks as a model Network properties Network paths Network motifs Information flow Graph clustering Biological networks Relational networks Correlative networks Causative/regulatory networks Applications Biological data integration Function prediction Resources and tools
Functional mapping: mining biological networks Predicted relationships between genes High Confidence Low The strength of these relationships indicates how cohesive a process is. Cell cycle genes
Functional mapping: mining biological networks Predicted relationships between genes High Confidence Low Cell cycle genes
Functional mapping: mining biological networks Predicted relationships between genes High Confidence Low The strength of these relationships indicates how associated two processes are. Cell cycle genes DNA replication genes
Predicting gene function Predicted relationships between genes High Confidence Low Cell cycle genes
Predicting gene function Predicted relationships between genes High Confidence Low Cell cycle genes
Predicting gene function Predicted relationships between genes High Confidence Low These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes
Outline Networks as a model Network properties Network paths Network motifs Information flow Graph clustering Biological networks Relational networks Correlative networks Causative/regulatory networks Applications Biological data integration Function prediction Resources and tools
Known Gene Regulatory Network: E. coli E. coli is a single-celled organism with a circular DNA structure encoding approximately 4000 genes (about 2500 “operons”) Probably has with most complete experimentally-constructed gene regulatory network. Used for many early investigations into GRN structure. http://regulondb.ccg.unam.mx/
Human Regulatory Information: ENCODE https://genome.ucsc.edu/ENCODE/
Protein Interaction Information: StringDB http://string-db.org/
Pathway Information http://www.biocarta.com/ http://www.genome.jp/kegg/ http://www.geneontology.org/
Network Analysis and Visualization http://www.cytoscape.org/ http://igraph.sourceforge.net/ http://www.graphviz.org/