Introduction to Bioinformatics Lecture 16 Intracellular Networks Graph theory C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Slides:



Advertisements
Similar presentations
Lecture 15. Graph Algorithms
Advertisements

BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
Clustering.
22C:19 Discrete Math Graphs Fall 2010 Sukumar Ghosh.
Introduction to Graph Theory Instructor: Dr. Chaudhary Department of Computer Science Millersville University Reading Assignment Chapter 1.
22C:19 Discrete Math Graphs Fall 2014 Sukumar Ghosh.
Greed is good. (Some of the time)
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Data Structures Using C++
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Pattern Recognition Introduction to bioinformatics 2007 Lecture 4 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part R5. Graphs.
25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Introduction to Bioinformatics Algorithms Clustering.
Pattern Recognition Introduction to bioinformatics 2005 Lecture 4.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Sequence comparison: Local alignment
Discrete Mathematics Lecture 9 Alexander Bukharovich New York University.
22C:19 Discrete Math Graphs Spring 2014 Sukumar Ghosh.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
C o n f i d e n t i a l HOME NEXT Subject Name: Data Structure Using C Unit Title: Graphs.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
GRAPH Learning Outcomes Students should be able to:
Data Structures Using C++ 2E
More Graph Algorithms 15 April Applications of Graphs Graph theory is used in dealing with problems which have a fairly natural graph/network.
Introduction to Bioinformatics Lecture 19 Intracellular Networks Graph theory C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
CSE, IIT KGP Matchings and Factors. CSE, IIT KGP Matchings A matching of size k in a graph G is a set of k pairwise disjoint edges.A matching of size.
Graph Theoretic Concepts. What is a graph? A set of vertices (or nodes) linked by edges Mathematically, we often write G = (V,E)  V: set of vertices,
Chapter 2 Graph Algorithms.
High-throughput Biological Data The data deluge and bioinformatics algorithms Introduction to bioinformatics 2005 Lecture 3.
GRAPHS CSE, POSTECH. Chapter 16 covers the following topics Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Sets.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Pattern Recognition Introduction to bioinformatics 2006 Lecture 4.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
TCP Traffic and Congestion Control in ATM Networks
Indian Institute of Technology Kharagpur PALLAB DASGUPTA Graph Theory: Introduction Pallab Dasgupta, Professor, Dept. of Computer Sc. and Engineering,
Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.
Fundamental Data Structures and Algorithms (Spring ’05) Recitation Notes: Graphs Slides prepared by Uri Dekel, Based on recitation.
Data Structures & Algorithms Graphs
Bipartite Matching. Unweighted Bipartite Matching.
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Network Flows Chun-Ta, Yu Graduate Institute Information Management Dept. National Taiwan University.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley. Ver Chapter 13: Graphs Data Abstraction & Problem Solving with C++
GRAPHS. Graph Graph terminology: vertex, edge, adjacent, incident, degree, cycle, path, connected component, spanning tree Types of graphs: undirected,
Graph Algorithms Maximum Flow - Best algorithms [Adapted from R.Solis-Oba]
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Introduction to Graph Theory By: Arun Kumar (Asst. Professor) (Asst. Professor)
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Indian Institute of Technology Kharagpur PALLAB DASGUPTA Graph Theory: Matchings and Factors Pallab Dasgupta, Professor, Dept. of Computer Sc. and Engineering,
1 GRAPH Learning Outcomes Students should be able to: Explain basic terminology of a graph Identify Euler and Hamiltonian cycle Represent graphs using.
IOI/ACM ICPC Training 4 June 2005.
Graph theory Definitions Trees, cycles, directed graphs.
Sequence comparison: Local alignment
High-throughput Biological Data The data deluge
Graphs Chapter 13.
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
Clustering.
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Presentation transcript:

Introduction to Bioinformatics Lecture 16 Intracellular Networks Graph theory C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

High-throughput Biological Data Enormous amounts of biological data are being generated by high-throughput capabilities; even more are coming –genomic sequences –gene expression data –mass spectrometry data –protein-protein interaction data –protein structures – Hidden in these data is information that reflects –existence, organization, activity, functionality …… of biological machineries at different levels in living organisms

Bio-Data Analysis and Data Mining Existing/emerging bio-data analysis and mining tools for –DNA sequence assembly –Genetic map construction –Sequence comparison and database search –Gene finding –…. –Gene expression data analysis –Phylogenetic tree analysis to infer horizontally-transferred genes –Mass spec. data analysis for protein complex characterization –…… Current prevailing mode of work Developing ad hoc tools for each individual application

Bio-Data Analysis and Data Mining As the amount and types of data and the needs to establish connections across multi-data sources increase rapidly, the number of analysis tools needed will go up “exponentially” –blast, blastp, blastx, blastn, … from BLAST family of tools –gene finding tools for human, mouse, fly, rice, cyanobacteria, ….. –tools for finding various signals in genomic sequences, protein-binding sites, splice junction sites, translation start sites, ….. Many of these data analysis problems are fundamentally the same problem(s) and can be solved using the same set of tools Developing ad hoc tools for each application problem (by each group of individual researchers) may soon become inadequate as bio-data production capabilities further ramp up

Data Clustering Many biological data analysis problems can be formulated as clustering problems –microarray gene expression data analysis –arrayCGH data (chromosomal gains and losses) –identification of regulatory binding sites (similarly, splice junction sites, translation start sites,......) –(yeast) two-hybrid data analysis (for inference of protein complexes) –phylogenetic tree clustering (for inference of horizontally transferred genes) –protein domain identification –identification of structural motifs –prediction reliability assessment of protein structures –NMR peak assignments –......

Data Clustering: an example Regulatory binding-sites are short conserved sequence fragments in promoter regions Solving binding-site identification as a clustering problem –Project all fragments into Euclidean space so that similar fragments are projected to nearby positions and dissimilar fragments to far positions –Observation: conserved fragments form “clusters” in a noisy background acgtttataatggcg ggctttatattcgtc ccgaatataatcta

Data Clustering Problems Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar” Cluster identification -- identifying clusters with significantly different features than the background

Multivariate statistics – Cluster analysis C1 C2 C3 C4 C5 C6.. Raw table Any set of numbers per column Multi-dimensional problems Objects can be viewed as a cloud of points in a multidimensional space Need ways to group the data

Multivariate statistics – Cluster analysis Dendrogram Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Raw table Similarity criterion Cluster criterion Any set of numbers per column

Cluster analysis – data normalisation/weighting C1 C2 C3 C4 C5 C6.. Raw table Normalisation criterion C1 C2 C3 C4 C5 C6.. Normalised table Column normalisationx/max Column range normalisation(x-min)/(max-min)

Cluster analysis – (dis)similarity matrix Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Raw table Similarity criterion D i,j = (  k | x ik – x jk | r ) 1/r Minkowski metrics r = 2 Euclidean distance r = 1 City block distance

Cluster analysis – Clustering criteria Dendrogram (tree) Scores Similarity matrix 5×5 Cluster criterion Single linkage - Nearest neighbour Complete linkage – Furthest neighbour Group averaging – UPGMA (phylogeny) Ward Neighbour joining – global measure (phylogeny)

Cluster analysis – Clustering criteria 1.Start with N clusters of 1 object each 2.Apply clustering distance criterion iteratively until you have 1 cluster of N objects 3.Most interesting clustering somewhere in between Dendrogram (tree) distance N clusters1 cluster

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2 Distance from point to cluster is defined as the smallest distance between that point and any point in the cluster

Single linkage clustering (nearest neighbour) Single linkage dendrograms typically show chaining behaviour (i.e., all the time a single object is added to existing cluster) Let C i and C j be two disjoint clusters: d i,j = Min(d p,q ), where p  C i and q  C j

Complete linkage clustering (furthest neighbour) Char 1 Char 2

Complete linkage clustering (furthest neighbour) Char 1 Char 2

Complete linkage clustering (furthest neighbour) Char 1 Char 2

Complete linkage clustering (furthest neighbour) Char 1 Char 2

Complete linkage clustering (furthest neighbour) Char 1 Char 2

Complete linkage clustering (furthest neighbour) Char 1 Char 2

Complete linkage clustering (furthest neighbour) Char 1 Char 2

Complete linkage clustering (furthest neighbour) Char 1 Char 2 Distance from point to cluster is defined as the largest distance between that point and any point in the cluster

Complete linkage clustering (furthest neighbour) More ‘structured’ clusters than with single linkage clustering Let C i and C j be two disjoint clusters: d i,j = Max(d p,q ), where p  C i and q  C j

Clustering algorithm 1. Initialise (dis)similarity matrix 2. Take two points with smallest distance as first cluster 3. Merge corresponding rows/columns in (dis)similarity matrix 4. Repeat steps 2. and 3. using appropriate cluster measure until last two clusters are merged

Average linkage clustering ( Unweighted Pair Group Mean Averaging -UPGMA) Char 1 Char 2 Distance from cluster to cluster is defined as the average distance over all within-cluster distances

UPGMA Let C i and C j be two disjoint clusters: 1 d i,j = ————————  p  q d p,q, where p  C i and q  C j |Ci| × |Cj| In words: calculate the average over all pairwise inter-cluster distances CiCi CjCj

Multivariate statistics – Cluster analysis Phylogenetic tree Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Data table Similarity criterion Cluster criterion

Multivariate statistics – Cluster analysis Scores 5× C1 C2 C3 C4 C5 C6 Similarity criterion Cluster criterion Scores 6×6 Cluster criterion Make two-way ordered table using dendrograms

Multivariate statistics – Two-way cluster analysis C4 C3 C6 C1 C2 C5 Make two-way (rows, columns) ordered table using dendrograms; This shows ‘blocks’ of numbers that are similar

Multivariate statistics – Two-way cluster analysis

Graph theory The river Pregal in Königsberg – the Königsberg bridge problem and Euler’s graph Can you start at some land area (S 1, S 2, I 1, I 2 ) and walk each bridge exactly once returning to the starting land area?

Graphs - definition Digraphs: Directed graphs Complete graphs: have all possible edges Planar graphs: can be presented in 2D and have no crossing edges (e.g. chip design)

Graph Adjacency matrix Graphs - definition An undirected graph has a symmetric adjacency matrix A digraph typically has a non-symmetric adjacency matrix

Example application – OBSTRUCT: creating non-redundant datasets of protein structures Based on all-against-all global sequence alignment Create all-against-all sequence similarity matrix Filter matrix based on desired similarity range (convert to ‘0’ and ‘1’ values) Form maximal clique (largest complete subgraph) by ordering rows and columns This is an NP-complete problem (NP = non- polynomial) and thus problem scales exponentially with number of vertices (proteins)

Example application 1 – OBSTRUCT: creating non-redundant datasets of protein structures Statistical research on protein structures typically requires a database of a maximum number of non- redundant (i.e. non-homologous) structures Often, two structures that have a sequence identity of less than 25% are taken as non-redundant Given an initial set of N structures (with corresponding sequences) and all-against-all pair-wise alignments: Find the largest possible subset where each sequence has <25% sequence identity with any other sequence Heringa, J., Sommerfeldt, H., Higgins, D., and Argos, P. (1992). Obstruct: a program to obtain largest cliques from a protein sequence set according to structural resolution and sequence similarity. Comp. Appl. Biosci. (CABIOS) 8,

Example application 1 – OBSTRUCT: creating non-redundant datasets of protein structures (Cnt.) The problem now can be formalised as follows: Make a graph containing all sequences as vertices (nodes) Connect two nodes with an edge if their sequence identity < 25% Make an adjacency matrix following the above rules Heringa, J., Sommerfeldt, H., Higgins, D., and Argos, P. (1992). Obstruct: a program to obtain largest cliques from a protein sequence set according to structural resolution and sequence similarity. Comp. Appl. Biosci. (CABIOS) 8,

Example application 1 – OBSTRUCT: creating non-redundant datasets of protein structures (Cnt.) The algorithm: Now try and reorder the rows (and columns in the same way) such that we get a square only consisting of 1’s in the upper left corner This corresponds to a complete graph (also called clique) containing a set of non-redundant proteins Heringa, J., Sommerfeldt, H., Higgins, D., and Argos, P. (1992). Obstruct: a program to obtain largest cliques from a protein sequence set according to structural resolution and sequence similarity. Comp. Appl. Biosci. (CABIOS) 8,

Example application 1 – OBSTRUCT: creating non-redundant datasets of protein structures (Cnt.)  Adjacency matrix 1.Order sum array and reorder rows and columns accordingly… 2.Estimate largest possible clique and take subset of adj. matrix containing only rows with enough 1s 3.For a clique of size N, a subset of M rows (and columns), where M  N, with at least N 1s is selected. 4.Go to step 1. Heringa, J., Sommerfeldt, H., Higgins, D., and Argos, P. (1992). Obstruct: a program to obtain largest cliques from a protein sequence set according to structural resolution and sequence similarity. Comp. Appl. Biosci. (CABIOS) 8,

Some books call graphs containing multiple edges or loops a multigraph, and those without a graph. Other books allow multiple edges or loops in a graph, but then talk about a graph without multiple edges and loops as a simple graph.

Remarks A multigraph might have no multiple edges or loops. Every (simple) graph is a multigraph, but not every multigraph is a (simple) graph. Every graph is finite Sometimes even “multigraph” folks talk about a “simple graph” to emphasize that there are no multiple edges and loops.

Further definitions K 3,3

Further definitions K 3,3 bipartite A graph is bipartite if its vertices can be partitioned into two disjoint subsets U and V such that each edge connects a vertex from U to one from V. A bipartite graph is a complete bipartite graph if every vertex in U is connected to every vertex in V. If U has n elements and V has m, then we denote the resulting complete bipartite graph by K n,m.

The Stable Marriage Algorithm In mathematics, the stable marriage problem (SMP) is the problem of finding a stable matching — a matching in which no element of the first matched set prefers an element of the second matched set that also prefers the first element. It is commonly stated as: –Given n men and n women, where each person has ranked all members of the opposite sex with a unique number between 1 and n in order of preference, marry the men and women off such that there are no two people of opposite sex who would both rather have each other than their current partners. If there are no such people, all the marriages are "stable". In 1962, David Gale and Lloyd Shapley proved that, for any equal number of men and women, it is always possible to solve the SMP and make all marriages stable.

The Stable Marriage Algorithm Also called the Gale-Shapley algorithm (see preceding slide) Given two non-overlapping equally sized graphs of men (A, B, C,..) and women (a, b, c, …), where each man and woman has a preference list about persons of the opposite sex A pairing denotes a 1-to-1 correspondence between men and women (each man marries one woman) A pairing is unstable if there are couples X-x and Y-y such that X prefers y to x and y prefers X to Y –if this happens, pair X-y is called unsatisfied A pairing in which there are no unsatisfied couples is called a stable pairing or stable marriage The Stable Marriage Algorithm forms a bipartite graph that is stable

A: abcd denotes the preferences of A (likes a the most, then b, then c, while d is liked least)

The Stable Marriage Algorithm The Gale-Shapley pairing, in the form presented here, is male-optimal and female-pessimal (it would be the reverse, of course, if the roles of "male" and "female" participants in the algorithm were interchanged). To see this, consider the definition of a feasible marriage. We say that the marriage between man A and woman B is feasible if there exists a stable pairing in which A and B are married. When we say a pairing is male-optimal, we mean that every man is paired with his highest ranked feasible partner. Similarly, a female- pessimal pairing is one in which each female is paired with her lowest ranked feasible partner. Thuis means that men pair with women higher in their preference list than where these men appear in the list of the women paired to them (on average)

Graph Adjacency matrix Graphs - definition An undirected graph has a symmetric adjacency matrix A digraph typically has a non-symmetric adjacency matrix

A Theoretical Framework Representation of a set of n-dimensional (n-D) points as a graph –each data point represented as a node –each pair of points represented as an edge with a weight defined by the “distance” between the two points n-D data points graph representation distance matrix

A Theoretical Framework Spanning tree: a sub-graph that has all nodes connected and has no cycles Minimum spanning tree: a spanning tree with the minimum total distance (a) (b) (c)

Spanning tree Prim’s algorithm (graph, tree) –step 1: select an arbitrary node as the current tree –step 2: find an external node that is closest to the tree, and add it with its corresponding edge into tree –step 3: continue steps 1 and 2 till all nodes are connected in tree (e) (b) 44 (c) (d) 7 (a)

Kruskal’s algorithm –step 1: consider edges in non-decreasing order –step 2: if edge selected does not form cycle, then add it into tree; otherwise reject –step 3: continue steps 1 and 2 till all nodes are connected in tree. (f) (a) (b) 3 4 (c) 3 Spanning tree 4 3 (d) (e) 5 6 reject 4 3 (e)

A Theoretical Framework A formal definition of a cluster: –C forms a cluster in D only if for any partition C = C1 U C2, the closest point, from D-C1, to C1 is from C2, in other words, the closest connection from the points outside C1 must come from within C2 Key results c1 c2 For any data set D, any of its cluster is represented by a sub-tree of its MST

A Theoretical Framework We can use the result on the preceding slide for clustering using PRIM’s algorithm –The order of nodes as selected by PRIM’s algorithm defines a linear representation, L(D), of a data set D –We can plot the node distances in PRIM’s minimum spanning tree against the order of the nodes as they were added by PRIM’s algorithm (see earlier slide) Any contiguous block in L(D) represents a cluster if and only if its elements form a sub-tree of the MST, plus some minor additional conditions (each cluster forms a valley) A B E D C A A B B C C DD E E edge weight (distance) PRIM’s

A Theoretical Framework Valleys correspond to clusters (red bars) So, if we first calculate all pairwise distances in the plots below, convert the dots to a (complete) graph, use PRIM’s algorithm to calculate a MST, and then make the plot as on the preceding slide, we get for real-size data: A nice clustering algorithm.... That can also be used for signal/noise filtering

Take home messages Revise agglomerative clustering (e.g. nearest/furthest neighbour, group averaging) Learn about graphs (complete, (un)directed, adjacency matrix, bipartite, etc.) Understand the relationship between a graph and its adjacency matrix Understand the Stable Marriage Algorithm Understand Prim’s and Kruskal’s algorithm –Make sure you understand the clustering method based on the minimal spanning tree made by Prim’s algorithm (preceding slide)