Approximate Labelled Subtree Homeomorphism Based on:  “Approximate Labelled Subtree Homeomorphism” R. Y. Pinter, O.Rokhlenko, D. Tsur, M. Ziv-Ukelson.

Slides:



Advertisements
Similar presentations
Graph Algorithms Algorithm Design and Analysis Victor AdamchikCS Spring 2014 Lecture 11Feb 07, 2014Carnegie Mellon University.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Chapter 9 Graphs.
Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.
Graph Isomorphism Algorithms and networks. Graph Isomorphism 2 Today Graph isomorphism: definition Complexity: isomorphism completeness The refinement.
Techniques for Dealing with Hard Problems Backtrack: –Systematically enumerates all potential solutions by continually trying to extend a partial solution.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Tirgul 7 Review of graphs Graph algorithms: –DFS –Properties of DFS –Topological sort.
MAX FLOW APPLICATIONS CS302, Spring 2013 David Kauchak.
Probabilistic networks Inference and Other Problems Hans L. Bodlaender Utrecht University.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Lectures on Network Flows
Fast FAST By Noga Alon, Daniel Lokshtanov And Saket Saurabh Presentation by Gil Einziger.
Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.
Approximation Algorithms
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Connected Components, Directed Graphs, Topological Sort COMP171.
NP-Complete Problems Reading Material: Chapter 10 Sections 1, 2, 3, and 4 only.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Connected Components, Directed Graphs, Topological Sort Lecture 25 COMP171 Fall 2006.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Randomness in Computation and Communication Part 1: Randomized algorithms Lap Chi Lau CSE CUHK.
Important Problem Types and Fundamental Data Structures
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
Sequence Alignment.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Chapter 2 Graph Algorithms.
1 ELEC692 Fall 2004 Lecture 1b ELEC692 Lecture 1a Introduction to graph theory and algorithm.
Approximating the Minimum Degree Spanning Tree to within One from the Optimal Degree R 陳建霖 R 宋彥朋 B 楊鈞羽 R 郭慶徵 R
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Data Structures Week 9 Introduction to Graphs Consider the following problem. A river with an island and bridges. The problem is to see if there is a way.
Phylogenetics II.
 2004 SDU Lecture 7- Minimum Spanning Tree-- Extension 1.Properties of Minimum Spanning Tree 2.Secondary Minimum Spanning Tree 3.Bottleneck.
Module #19: Graph Theory: part II Rosen 5 th ed., chs. 8-9.
Qiong Cheng, Robert Harrison, Alexander Zelikovsky Computer Science in Georgia State University Oct IEEE 7 th International Conference on BioInformatics.
 Rooted tree and binary tree  Theorem 5.19: A full binary tree with t leaves contains i=t-1 internal vertices.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Greedy algorithm for obtaining Minimum Feedback vertex set MFVS delete degree 1/0 vertices from V and set remaining vertices to V’ MFVS←  while V’  
Spanning tree Lecture 4.
Fall 2008Simple Parallel Algorithms1. Fall 2008Simple Parallel Algorithms2 Scalar Product of Two Vectors Let a = (a 1, a 2, …, a n ); b = (b 1, b 2, …,
Great Theoretical Ideas in Computer Science for Some.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
1 Assignment #3 is posted: Due Thursday Nov. 15 at the beginning of class. Make sure you are also working on your projects. Come see me if you are unsure.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
5.6 Prefix codes and optimal tree Definition 31: Codes with this property which the bit string for a letter never occurs as the first part of the bit string.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.
Discrete Structures Li Tak Sing( 李德成 ) Lectures
1 Minimum Routing Cost Tree Definition –For two nodes u and v on a tree, there is a path between them. –The sum of all edge weights on this path is called.
CSCI2950-C Lecture 12 Networks
Graph theory Definitions Trees, cycles, directed graphs.
Lectures on Network Flows
Character-Based Phylogeny Reconstruction
Depth-First Search.
Topological Sort (topological order)
Various Graph Algorithms
Graph Algorithms Using Depth First Search
Comparative RNA Structural Analysis
CS 581 Tandy Warnow.
CSE 589 Applied Algorithms Spring 1999
COMPS263F Unit 2 Discrete Structures Li Tak Sing( 李德成 ) Room A
Flow Networks and Bipartite Matching
Text Book: Introduction to algorithms By C L R S
Trees-2, Graphs Data Structures with C Chpater-6 Course code: 10CS35
Important Problem Types and Fundamental Data Structures
Presentation transcript:

Approximate Labelled Subtree Homeomorphism Based on:  “Approximate Labelled Subtree Homeomorphism” R. Y. Pinter, O.Rokhlenko, D. Tsur, M. Ziv-Ukelson  “Alignment of Metabolic Pathways” R. Y. Pinter, O. Rokhlenko, E. Yeger-Lotem, M. Ziv-Ukelson

The general Idea Biological Problem Converting into terms of computer science problem Finding the solution Reverting back to Biological terms

Metabolism

IL-2 Th1 TNF-  IFN-  Proliferation IL-12 Ag Stimuli Thnp IL-12R T-Bet Stat 4 Signal transduction

Why pathways?  Metabolic and regulatory pathways have biological importance.  These pathways are evolutionary conserved.

What do we want to do?  Compare one metabolic pathway of a certain organism against the same metabolic pathways in other organisms.  Compare a metabolic pathway against other metabolic pathways in the same organism.

How do we do (it)?

The subtree homeomorphism problem: Given a pattern tree P and a text tree T, find a subtree of T which is isomorphic to P or decide that there is no such tree. Degree 2 node can be deleted from the text tree. ?

Graph homeomorphism Text Pattern Colors ?

Graph homeomorphism Text Pattern

Graph homeomorphism Text Pattern Labels (similarity) topology

Back to 2 nd semester…  An unrooted tree is an undirected, acyclic, connected graph (T=(V T,E T ((  A rooted tree is a triple T r =(V T,E T,r( where (V T,E T ( is an unrooted tree, and r is some vertex in V which is called the root. The root node of the tree implies the direction for all the edges in the graph.  A multi-source tree is an acyclic, directed graph, whose underlying undirected graph is a tree.

Back to 2 nd semester… A tree is said to be ordered if the relative order of its subtree in each node is fix. Otherwise a tree is unordered. for “ordered” Problem:

What are we allowed to do? Taking into account both label similarity and topology. We are permited to delete vertexes from the text tree. We are NOT permited to delete vertexes from the pattern tree.

What we “gonna” see today: Rooted unordered O(m 2 n + mn log n) Unrooted unordered O(m 2 n + mn log n) Directed multi source unordered O(m 2 n + mn log n) Rooted orderedO(mn)

Some definitions:  Let Δ denote a predefined node-to-node similarity score table.  Let D denote a predefined score for deleting a node from a tree (usually a penalty).  A mapping M from T 1 to T 2 is a partial one to one map from the nodes of T 1 to the nodes of T 2 that preserves the ancestor relations of the nodes.

Our problem: Let M be a mapping from T 1 to T 2. The Labelled Subtree Homeomorphic Similarity Score of M[T 1,T 2 ] is: LSH (M[T 1,T 2 ]) = D  (|T 1 |-|T 2 |) + ∑ (u,v) ∈ M Δ]u,v] Given two undirected labeled trees P and T, We want to find a mapping M and a subtree t of T, such that: LSH (M [t,P]) is maximal.

Scoring Text Pattern Score Score: 2 Score: 2 Score: 5

Dynamic programming v u x1x1 x2x2 y3y3 y2y2 y1y1 T P x1x1 x2x2 …u… y1y1 w 11 w 12 w1mw1m y2y2 w 21 w 22 w1mw1m y3y3 w 31 w 32 w1mw1m … vwn1wn1 wn1wn1 w nm

RScore[u,v] is the maximum between two terms:  The node-to-node similarity value Δ [v,u] plus the sum of the weights of the matched edges in the maximal assignment over G. This term is only compute if c(u) ≤ c(v) (otherwise: - ∞).  The weight RScore[y i,u] for the comparison of u and the best scoring child y i of v, updated with the penalty for deleting v. C(u) is the number of the children of u

RScore[u,v] - example Pattern Text score matrix deletion = -1 ab u10 -∞-∞ v9 deletion 8 w8 deletion 12 ab u10 v5-2 w33 a b u v w Max {5,10-1} = 9 Max {3,9-1} = 8

The assignment problem Let G be a bipartite graph G = (V = X U Y,E) with weights w (x,y) for all edges. The assignment problem is to compute a matching M (list of monogamic pairs) such that:  The size of M is maximal among all the matchings.  From all the matchings above, The sum of the weights is maximal.

Solving the assignment problem  Reduction from the assignment problem to the min cost max flow problem.  We’ll construct G’ which contains G(V,E) with the following changes:  Two more vertexes: s,t  Edges from s to X and from Y to t, while w (s,x) = 0, w (y,t) =0  The cost of the other edges in E is –w (x,y)  The capacity of all edges is 1 What is it? Among all the maximal flows we’ll choose the cheapest

From assignment to matching u x1x1 x2x2 v y3y3 y1y1 y2y2 x1x1 x2x2 y2y2 y2y2 y2y2 s t

Time complexity analysis  Edmonds and Karp’s algorithm: O(EV*logV)  Fredman and Tarjan: O(VE + V 2 logV) (independent of the edges cost)  Gabow and Tarjan: O(V 1/2 Elog(VC) where the input costs are integers and in the range [-C,….,C] (the similarity assumption)

Reminder…

What did we have so far?  Motivation  “Advanced” homeomorphism: labels and topology  Scoring and deletion  Dynamic programming  Matching  Questions?

The algorithm for rooted unordered trees:  Input: Rooted trees T = (V T,E T,r) and P = (V P,E P,r’ )).  Output: The root of the subtree t of T which has the highest similarity score to P, (and homeomorphic to P).

for each node u of P in postorder do for each node v of T in postorder do if u is leaf then if v is leaf then RScore(v, u) = Δ [v,u] else RScores(v,u) = ComputeScores (v,u) end if else if Level(u) > Level(v) then RScore(v, u) = -∞ else RScores(v,u) = ComputeScores (v,u) end if ; end if; end for; end for Dynamic programming Node to node score Delete from the pattern

Let k denote the out-degree of node u and l denote the out degree of node v if k >l then AssignmentScore(G) = -∞ else Construct a bipartite graph G with node bipartition X and Y such that: X is the set of children {x 1 … x k { of u, Y is the set of children {y 1 … y l { of v, node u i ∈ X X is connected to node v j ∈ Y via an edge whose weight w(u i,v j ) is set to RScore(v j,u i ). AssignmentScore(G) = max ∑ (i,j) ∈ M RScores[y j,x i ] end if Find, among all children of v, the node BestChild(v,u) whose ALSH score with u is highest: BestChild(v,u) = max j=1 to l RScore(y j, u) return max {Δ [v,u]+AssignmentScore(G),BestChild(v,u)+δ} Procedure ComputeScores (v,u) Deletion penalty

Time complexity analysis Observation 1: ∑ u =1 to m c(u) = m-1 ∑ v =1 to n c(v) = n-1 The number of the vertexes in the pattern

Time complexity analysis The weighted assignment is computed once for each pair u,v u  T, v  P In a bipartite graph there are c(v)+c(u) nodes and c(v)  c(u) edges. Based on Fredman and Tarjan the time complexity is:Fredman and Tarjan O(∑ u=1 to m ∑ v=1 to n )c(u) 2 )c(v)+c(u)c(v) log (c(v)) = (observation 1) O(∑ u=1 to m c(u) 2 )n+c(u)n log n) = (observation 1) O(m 2 n + mn log n)

Unrooted unordered trees:  The problem: each vertex in both the text tree and the pattern tree can be the root.  The naïve solution: choose an arbitrary node r of T to get a rooted tree. Next, for each u P compute rooted ALSH between P u and T r.  Time complexity: O(m 3 n+m 2 n log n)

2 nd try:  Select an arbitrary node r in T as the root  For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v

2 nd try:  Select an arbitrary node r in T as the root  For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v

2 nd try:  Select an arbitrary node r in T as the root  For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v

General idea for keeping the time complexity  Find the best match between the children {x 1,..,x n ) of v ∈ T and {y 1,…,y m } of u ∈ P.  After computing the best match and removing a node x i (which act as the parent of u) there is a way to find the optimal matching between {x 1,…,x n }\x i and {y 1,…,y m } in O(d(u)c(v)+c(v) log c(v))  The total time complexity for computing all assignments between v and u: O(d(u) 2 c(v)+d(u)c(v) log c(v))

Time complexity Observation 2: The sum of vertex degrees in an unrooted tree P is ∑ u =1 to m d(u) = 2m-2 We’ve study that at Combinatorics

Time complexity – continue… O((∑ u =1 to m ∑ v =1 ton d(u) 2 c(v))+d(u)c(v) log c(v)) = O((∑ u =1 to m d(u) 2 n +d(u)n log n) = O(m 2 n + mn log n) Observation 1 Observatin 2

Up the tree… For each vertex v ∈ T, u ∈ P and x i ∈ neighbors (u), UScore[v,u, x i [ is the maximal LSH between a subtree p u, x i of P and a corresponding homeomorphic subtree of t v, r if one exists. otherwise, UScore[v,u,x i ] is set to -∞ A subtree in P which his root is u and the root’s parent is x i

UScore[u,v,x i ] is the maximum between two terms:  The node-to-node similarity value Δ [v,u] plus the sum of the weights of the matched edges in the maximal assignment over G i. This term is only compute if d(u) - 1  c(v) (otherwise: - ∞).  The weight UScore[y i,u,x i ] for the comparison of u and the best scoring child y i of v, updated with the penalty for deleting v. d(u) is the degree of u

And if ‘u’ is the root…  We have to compute an additional entry UScore[v,u,Φ].  This entry represent the fact that u might be the root of P.  The root of P will be node u such that: UScore[v,u,Φ] is maximal.

Multi-source graphs  DAG = Directed Acyclic Graph.  A multi-source tree is a DAG whose its underlying structure is an unrooted, unordered trees.

Multi-source graph - example pattern text UScore[u,v,r’] = -∞ r’ r u v

Multi-source graphs & alignment  We’ll use the algorithm for the unrooted unordered tress.  We’ll filter out subtree alignments that map together edges of conflicting direction.  We’ll split the bipartite graph G = {X U Y,E} into two different graphs: one correspond to macthing of incoming-edge neighbors of u and v and the other for matching outgoing edge neighbors.

ALSH for ordered rooted trees

Solving ALSH for ordered rooted trees  Maximum weighted matching problem on ordered bipartite graphs, where no edges are allowed to cross.  Given a pattern string X, a source Y, and a character to character similarity table Δ[∑ X, ∑ Y ], find among all |X|-sized subsequences of Y the subsequence Q which is most similar to X, that is, the sum ∑ i=1 to |X| Δ[Q i,X i ] is maximized.

String alignment y3y3 y2y2 y1y1 k i +1 y1y1 y2y2 y3y3 x1x1 x2x2 l j +1 -∞ 000 ∆ We can’t delete nodes from the pattern tree This is NOT the deletion penalty

Time complexity for rooted ordered For each node pair (v ∈ T,u ∈ P), the time complexity of the assignmentb is O(c(u)  c(v)) (dynamic programming) ∑ u =1 to m ∑ v =1 to n O(c(v)  c(u)) = ∑ v =1 to n O(m  c(v)) = O(m  n) Observation 1

The tool: MetaPathwayHunter

What can it do?  A pathway against a pathway - 5 best alignments.  A pathway against a directory of pathways – 5 best alignment for pathway in the directory (sorted by score).

Two extreme cases of deletion penalty Assuming the similarity score is negative ( ≤ 0) Deletion penalty 0: always worth deleting Deletion penalty -∞ : never worth deleting

Deletion penalty 0 What does it mean?

Deletion penalty -∞ What does it mean?

About the similarity score  MetaPathwayHunter uses the EC (Enzyme Commission) classification.  Four sets of numbers that categorize the type of the catalyzed chemical reaction. (e.g ).  For an enzyme class h, C(h) denotes the number of enzymes whoose classes are included under h.  For two enzymes e i and e j, if their lowest common upper class is h ij, then the similarity between then is –log 2 C(h).

Similarity score - example Δ[ , ] = -log 2 C( ) =-log 2 (14)= Δ[ , ] = -log 2 C( ) = -log 2 (20) = These are not enzymes

Is the result statistically significant?  Statistical significance is base on p-value.  The p-value of an alignment (scored s) is calculated by aligning the same query against 100 random pathway graphs, and counting the fraction of graphs containing an alignment that receive score s or higher.  A random pathway is a graph containing the same set of nodes and the same number of edges for each node, with random switch of the nodes.

Inter species alignment  113 E. coli pathways and 151 S. cerevisiae pathways.  610 pathway pairs had at least one statistically significant alignment between them.  63% of the E. coli and 66% of S. cerevisiae had at least one statistically significantly aligned pair-mate from the other species

Inter species alignment E. Coli & S. cerevisiae: Phenilalanine, tyrosine and tryptophan pathway (score: -4.28) from [1]

Inter species alignment  What is the single mismatch?  In E. coli: the enzyme uses NAD+  In S. cerevisia: the enzyme uses NADP+  These two enzyme doesn’t have a significant sequence similarity.  == Two functional orthologs.

A meta-pathway query E. colly allantoin degradation (score =0) S. cerevisia ureide degradation (score=0)

summary Biological motivation Homeomorphism Scoring and deleting From assignment to matching The algorithm for rooted unordered trees How to keep the time complexity for unrooted unordered trees

summary How to deal with Multi-source graphs The algorithm for rooted ordered trees The MetaPathwayHunter and its properties Results of alignments

THE END