341: Introduction to Bioinformatics

Slides:



Advertisements
Similar presentations
Chapter 28 Weighted Graphs and Applications
Advertisements

EE384y: Packet Switch Architectures
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
STATISTICS Univariate Distributions
Scalable Routing In Delay Tolerant Networks
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Chapter 7 Sampling and Sampling Distributions
1 Generating Network Topologies That Obey Power LawsPalmer/Steffan Carnegie Mellon Generating Network Topologies That Obey Power Laws Christopher R. Palmer.
Hash Tables.
Chapter 16 Goodness-of-Fit Tests and Contingency Tables
Network analysis Sushmita Roy BMI/CS 576
Scale Free Networks.
Biological Networks Analysis Degree Distribution and Network Motifs
Quantitative Analysis (Statistics Week 8)
Graphs, representation, isomorphism, connectivity
CS1022 Computer Programming & Principles Lecture 8.1 Digraphs (1)
Introduction to Feedback Systems / Önder YÜKSEL Bode plots 1 Frequency response:
Copyright © 2013, 2009, 2006 Pearson Education, Inc. 1 Section 5.4 Polynomials in Several Variables Copyright © 2013, 2009, 2006 Pearson Education, Inc.
Rational Functions and Models
© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Multiple Regression and Model Building
Energy Generation in Mitochondria and Chlorplasts
9. Two Functions of Two Random Variables
The Small World Phenomenon: An Algorithmic Perspective Speaker: Bradford Greening, Jr. Rutgers University – Camden.
Based on slides by Y. Peng University of Maryland
Analysis and Modeling of Social Networks Foudalis Ilias.
Network Properties 1.Global Network Properties ( Chapter 3 of the course textbook “Analysis of Biological Networks” by Junker and Schreiber) 1)Degree distribution.
VL Netzwerke, WS 2007/08 Edda Klipp 1 Max Planck Institute Molecular Genetics Humboldt University Berlin Theoretical Biophysics Networks in Metabolism.
341: Introduction to Bioinformatics Dr. Nataša Pržulj Department of Computing Imperial College London Winter 2011.
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Global topological properties of biological networks.
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
Network analysis and applications Sushmita Roy BMI/CS 576 Dec 2 nd, 2014.
Systems Biology, April 25 th 2007Thomas Skøt Jensen Technical University of Denmark Networks and Network Topology Thomas Skøt Jensen Center for Biological.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
Large-scale organization of metabolic networks Jeong et al. CS 466 Saurabh Sinha.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
A graph theory approach to characterize the relationship between protein functions and structure of biological networks Serene Wong March 15, 2011.
Biological Networks Lectures 6-7 : February 02, 2010 Graph Algorithms Review Global Network Properties Local Network Properties 1.
Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.
Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj
Social Network Analysis Prof. Dr. Daning Hu Department of Informatics University of Zurich Mar 5th, 2013.
Optimal Network Alignment with Graphlet Degree Vectors
Analysis of biological networks Part III Shalev Itzkovitz Shalev Itzkovitz Uri Alon’s group Uri Alon’s group July 2005 July 2005.
Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2.
Algorithms for Biological Networks Prof. Tijana Milenković Computer Science and Engineering University of Notre Dame Fall 2010.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
341: Introduction to Bioinformatics
Informatics tools in network science
Course Name: Comparative Genomics Conducted by- Shigehiko kanaya & Md. Altaf-Ul-Amin.
Response network emerging from simple perturbation Seung-Woo Son Complex System and Statistical Physics Lab., Dept. Physics, KAIST, Daejeon , Korea.
Algorithms and Computational Biology Lab, Department of Computer Science and & Information Engineering, National Taiwan University, Taiwan Network Biology.
Groups of vertices and Core-periphery structure
Biological networks CS 5263 Bioinformatics.
Section 8.6: Clustering Coefficients
Biological Networks Analysis Degree Distribution and Network Motifs
Section 8.6 of Newman’s book: Clustering Coefficients
CSCI2950-C Lecture 13 Network Motifs; Network Integration
Clustering Coefficients
Modelling Structure and Function in Complex Networks
Presentation transcript:

341: Introduction to Bioinformatics Dr. Nataša Pržulj Department of Computing Imperial College London natasha@imperial.ac.uk Winter 2011

Topics Introduction to biology (cell, DNA, RNA, genes, proteins) Sequencing and genomics (sequencing technology, sequence alignment algorithms) Functional genomics and microarray analysis (array technology, statistics, clustering and classification) Introduction to biological networks Introduction to graph theory Network properties Network/node centralities Network motifs Network models Network/node clustering Network comparison/alignment Software tools for network analysis Interplay between topology and biology 2 2

Network Comparisons: Properties of Large Networks Large network comparison is computationally hard due to NP-completeness of the underlying subgraph isomorphism problem: Given 2 graphs G and H as input, determine whether G contains a subgraph that is isomorphic to H. Thus, network comparisons rely on easily computable heuristics (approximate solutions), called “network properties” Network properties can roughly & historically be divided in two categories: Global network properties: give an overall view of the network, but might not be detailed enough to capture complex topological characteristics of large networks. Local network properties: more detailed network descriptors which usually encompass larger number of constraints, thus reducing degrees of freedom in which the networks being compared can vary. 3

1. Global Network Properties Readings: Chapter 3 of “Analysis of biological networks” by Junker and Björn Global Network Properties: Degree distribution Average clustering coefficient Clustering spectrum Average Diameter Spectrum of shortest path lengths Centralities

1. Global Network Properties Degree Distribution Definitions: degree of a node is the number of edges incident to the node. Average degree of a network: average of the degrees over all nodes in the network. However, avg. deg might not be representative, since the distribution of degrees might be skewed. x deg(x)=5

1. Global Network Properties 1) Degree Distribution Let P(k) be the percentage of nodes of degree k in the network. The degree distribution is the distribution of P(k) over all k. P(k) can be understood as the probability that a node has degree k.

1. Global Network Properties 1) Degree Distribution Example: (log-log plot) Here P(k) ~ k-γ , where often 2 ≤ γ < 3. This is a power-law, heavy-tailed distribution. Networks with power-law degree distributions are called scale-free networks. In them, most of the nodes are of low degree, but there is a small number of highly-linked nodes (nodes of high degree) called “hubs.”

1. Global Network Properties 1) Degree Distribution Another Example: average degree is meaningful Here P(k) is a Poisson distribution.

1. Global Network Properties 1) Degree Distribution However: degree distribution (and global properties in general) are weak predictors of network structure. Illustration: G and H are of the same size (i.e.,|G|=|H| -- they have the same number of nodes and edges) and they have same degree distribution, but G and H have very different topologies (i.e., graph stucture).

Examples: G

Research debates… Assortative vs. disassortative mixing of degrees: Do high-degree nodes interact with high-degree nodes? Done by: Pearson corr. coefficient between degrees of adjacent vertices Average neighbor degree; then average over all nodes of degree k Structural robustness and attack tolerance: “Robust, yet fragile” Scale-free degree distribution: “Party” vs. “date” hubs J.D. Han et al., Nature, 430:88-93, 2004 Bias in the data collection – sampling? M. Stumpf et al., PNAS, 102:4221-4224, 2005 J. Han et al., Nature Biotechnology, 23:839-844, 2005 High degree nodes: Essential genes H. Jeong at al., Nature 411, 2001. Disease/cancer genes Jonsson and Bates, Bioinformatics, 22(18), 2006 Goh et al., PNAS, 104(21), 2007 11

1. Global Network Properties 2) Average Clustering Coefficient Definition: clustering coefficient Cv of a node v: Cv = |E(N(v))|/(max possible number of edges in N(v)) Where N(v) the neighborhood of v, i.e., all nodes adjacent to v Cv can be viewed as the probability that two neighbors of v are connected. Thus 0 ≤ Cv ≤ 1. By definition: For vertex v of degree 0 or 1, by definition Cv=0.

1. Global Network Properties 2) Average Clustering Coefficient Example: |N(v)|= 4, since there are 4 nodes in N(v), i.e., N(v)= {1, 2, 3, 4} |E(N(v))|= 3, since there are 3 edges between nodes in N(v) Max possible number of edges between nodes in N(v) is: choose(4,2) = 6. Therefore Cv= 3/6 = 1/2

1. Global Network Properties 2) Average Clustering Coefficient Definition: average clustering coefficient, C, of a network is the average Cv over all the nodes v∈ V.

1. Global Network Properties 3) Clustering Spectrum Definition: clustering spectrum, C(k), is the distribution of the average clustering coefficients of all nodes of degree k in the network, over all k. Example:

of degree k E.g. 2) And 3) Clustering Coefficient and Spectrum Cv – Clustering coefficient of node v CA= 1/1 = 1 CB = 1/3 = 0.33 CC = 0 CD = 2/10 = 0.2 … C = Avg. clust. coefficient of the whole network = avg {Cv over all nodes v of G} C(k) – Avg. clust. coefficient of all nodes of degree k E.g.: C(2) = (CA + CC)/2 = (1+0)/2 = 0.5 => Clustering spectrum E.g. (not for G) G Need to evaluate whether the value of C (or any other property) is statistically significant.

1. Global Network Properties 4) Average Diameter Definition: the distance between two nodes is the smallest number of links that have to be traversed to get from one node to the other. Definition: the shortest path is the path that achieves that distance. Definition: the average network diameter is the average of shortest path lengths over all pairs of nodes in a network.

1. Global Network Properties 5) Spectrum of shortest path lengths Definition: Let S(d) be the percentage of node pairs that are at distance d. The spectrum of shortest path lengths is the distribution of S(d) over d. Example:

4) and 5) Average Diameter and Spectrum of Shortest Path Lengths Distance between a pair of nodes u and v: Du,v = min {length of all paths between u and v} = min {3,4,3,2} = 2 = dist(u,v) Average diameter of the whole network: D = avg {Du,v for all pairs of nodes {u,v} in G} Spectrum of the shortest path lengths G v E.g. (not for G)

1. Global Network Properties 6) Node Centralities (Readings: Chapter 3 of “Analysis of biological networks”-Junker,Björn) Rank nodes according to their “topological importance” Definition: Centrality quantifies the topological importance of a node (edge) in a network. There are many different types of centralities. There are many different types of centralities: Degree centrality Closeness centrality Eccentricity centrality Betweenness centrality Subgraph centrality Eigenvector centrality Software tools: Visone (social nets) and CentiBiN (biological nets)

1. Global Network Properties 6) Node Centralities Definitions: Degree centrality, Cd(v): nodes with a large number of neighbors (i.e., edges) have high centrality. Therefore, we have Cd(v)=deg(v). Example of a use of degree centrality: In PPI networks, nodes with high degree centrality are considered to be “biologically important.” We will learn later in the course what this means. 2. Closeness centrality, Cc(v): nodes with short paths to all other nodes in the network have high closeness centrality Cc(v)=

1. Global Network Properties 6) Node Centralities Definitions: 3. Betweenness centrality, Cb(v): Nodes (or edges) which occur in many of the shortest paths have high betweeness centrality. Cb(v)= Above: The above summation means that there is a sum on the top and on the bottom of the fraction. σst(v) = the number of shortest paths from s to t that pass through v σst = the number of all shortest paths from s to t (they may or not pass through node v) 22

1. Global Network Properties 6) Node Centralities Definitions: 4. Eccentricity centrality, Ce(v): nodes with short paths to any other node have high eccentricity centrality Eccentricity of a node v is defined as ecc(v) = So it is the maximum shortest path length from node u to all other nodes v in V. Eccentricity centrality of a node v: Thus, central nodes have higher Ce since they have lower ecc. There exist many other definitions of node centralities. 23 23

1. Global Network Properties 6) Node Centralities Example: Degree Closeness Betweeness From highest D F, G H D, H to A, B I C, E, H C, E lowest J C, D, J

1. Global Network Properties 6) Node Centralities You need to know how to compute these centralities (and all other network properties) by hand on small networks. For large real-world networks, you could use software, e.g., CentiBiN. http://centibin.ipk-gatersleben.de/

Network Properties 2. Local Network Properties (Chapter 5 of the course textbook “Analysis of Biological Networks” by Junker and Schreiber) They encompass a larger number of constraints, thus reducing degrees of freedom in which networks being compared can vary How do we show that two networks are different? How do we show that they are the same? How do we quantify the level of similarity?

Network Properties 2. Local Network Properties (Chapter 5 of the course textbook “Analysis of Biological Networks” by Junker and Schreiber) Network motifs Graphlets Two network comparison measures based on graphlets: 2.1) Relative Graphlet Frequence Distance between two networks 2.2) Graphlet Degree Distribution Agreement between two networks 27

2. Local Network Properties 1) Network Motifs (Uri Alon’s group, 2002-2004) Definition: A network motif is a small over-represented partial subgraph of real network. Here, over-represented means that it is over-represented when compared to networks coming from a random graph model. Problem: What is expected at random, i.e., which network “null model” to use to identify motifs?

2. Local Network Properties 1) Network Motifs Example of a random graph model: Erdos-Renyi (ER) random graphs – Definition: A graph on n nodes (for some positive integer n) Edges are added between pairs of nodes uniformly at random with same probability p ER graphs usually have a small number of dense (in term of number of edges) subgraphs There will be no regions in the network that have large density of edges. Why?

2. Local Network Properties 1) Network Motifs Example: If motifs are identified when comparing the data with ER model networks, every dense subgraph would come up as a motif because they do not exist in our ER model networks.

1) Network motifs (Uri Alon’s group, ’02-’04) Small subgraphs that are overrepresented in a network when compared to randomized networks Network motifs: Reflect the underlying evolutionary processes that generated the network Carry functional information Define superfamilies of networks  - Zi is statistical significance of subgraph i, SPi is a vector of numbers in 0-1 But: Functionally important but not statistically significant patterns could be missed The choice of the appropriate null model is crucial, especially across “families” Feed-forward loop

1) Network motifs (Uri Alon’s group, ’02-’04) Small subgraphs that are overrepresented in a network when compared to randomized networks Network motifs: Reflect the underlying evolutionary processes that generated the network Carry functional information Define superfamilies of networks  - Zi is statistical significance of subgraph i, SPi is a vector of numbers in 0-1 But: Functionally important but not statistically significant patterns could be missed The choice of the appropriate null model is crucial, especially across “families” Random graphs with the same in- and out- degree distribution as data might not be the best network null model Motifs are partial subgraphs, while we use induced ones to understand network structure

2. Local Network Properties 1) Network Motifs Example: Feed-forward loop Shen-Orr, Milo, Mangan, and Alon, “Network motifs in the transcriptional regulation network of Escherichia coli,” Nature Genetics, 2002

1) Network motifs (Uri Alon’s group, ’02-’04) http://www.weizmann.ac.il/mcb/UriAlon/ Also, see Pajek, MAVisto, and FANMOD

2) Graphlets (Przulj group, ’04-’10) _____ Different from network motifs: Induced subgraphs Of any frequency (don’t need to be over-represented) N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.

N. Przulj, D. G. Corneil, and I N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.

N. Przulj, D. G. Corneil, and I N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.

2.1) Relative Graphlet Frequency (RGF) distance between networks G and H: N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.

2.2) Graphlet Degree Distributions Generalize node degree 2.2) Graphlet Degree Distributions

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.

Network structure vs. biological function & disease Graphlet Degree (GD) vectors, or “node signatures” T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.

Similarity measure between “node signature” vectors T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.

Signature Similarity Measure between nodes u and v T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.

T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

SMD1 YBR095C 40% PMA1 T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

90%* *Statistically significant threshold at ~85% SMD1 RPO26 SMB1 T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

Later we will see how to use this and other techniques to link network structure with biological function

Generalize Degree Distribution of a network The degree distribution measures: the number of nodes “touching” k edges for each value of k N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” Bioinformatics, vol. 23, pg. e177-e183, 2007.

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” Bioinformatics, vol. 23, pg. e177-e183, 2007.

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” Bioinformatics, vol. 23, pg. e177-e183, 2007.

This is called Graphlet Degree Distribution (GDD) Agreement / sqrt(2) ( to make it between 0 and 1) This is called Graphlet Degree Distribution (GDD) Agreement between networks G and H.

Software that implements many of these network properties and compares networks with respect to them: GraphCrunch http://bio-nets.doc.ic.ac.uk/graphcrunch/

Software that implements many of these network properties and compares networks with respect to them: GraphCrunch http://bio-nets.doc.ic.ac.uk/graphcrunch2/

Topics Introduction to biology (cell, DNA, RNA, genes, proteins) Sequencing and genomics (sequencing technology, sequence alignment algorithms) Functional genomics and microarray analysis (array technology, statistics, clustering and classification) Introduction to biological networks Introduction to graph theory Network properties Network/node centralities Network motifs Network models Network/node clustering Network comparison/alignment Software tools for network analysis Interplay between topology and biology 56 56