Approximate Labelled Subtree Homeomorphism Based on: “Approximate Labelled Subtree Homeomorphism” R. Y. Pinter, O.Rokhlenko, D. Tsur, M. Ziv-Ukelson “Alignment of Metabolic Pathways” R. Y. Pinter, O. Rokhlenko, E. Yeger-Lotem, M. Ziv-Ukelson
The general Idea Biological Problem Converting into terms of computer science problem Finding the solution Reverting back to Biological terms
Metabolism
IL-2 Th1 TNF- IFN- Proliferation IL-12 Ag Stimuli Thnp IL-12R T-Bet Stat 4 Signal transduction
Why pathways? Metabolic and regulatory pathways have biological importance. These pathways are evolutionary conserved.
What do we want to do? Compare one metabolic pathway of a certain organism against the same metabolic pathways in other organisms. Compare a metabolic pathway against other metabolic pathways in the same organism.
How do we do (it)?
The subtree homeomorphism problem: Given a pattern tree P and a text tree T, find a subtree of T which is isomorphic to P or decide that there is no such tree. Degree 2 node can be deleted from the text tree. ?
Graph homeomorphism Text Pattern Colors ?
Graph homeomorphism Text Pattern
Graph homeomorphism Text Pattern Labels (similarity) topology
Back to 2 nd semester… An unrooted tree is an undirected, acyclic, connected graph (T=(V T,E T (( A rooted tree is a triple T r =(V T,E T,r( where (V T,E T ( is an unrooted tree, and r is some vertex in V which is called the root. The root node of the tree implies the direction for all the edges in the graph. A multi-source tree is an acyclic, directed graph, whose underlying undirected graph is a tree.
Back to 2 nd semester… A tree is said to be ordered if the relative order of its subtree in each node is fix. Otherwise a tree is unordered. for “ordered” Problem:
What are we allowed to do? Taking into account both label similarity and topology. We are permited to delete vertexes from the text tree. We are NOT permited to delete vertexes from the pattern tree.
What we “gonna” see today: Rooted unordered O(m 2 n + mn log n) Unrooted unordered O(m 2 n + mn log n) Directed multi source unordered O(m 2 n + mn log n) Rooted orderedO(mn)
Some definitions: Let Δ denote a predefined node-to-node similarity score table. Let D denote a predefined score for deleting a node from a tree (usually a penalty). A mapping M from T 1 to T 2 is a partial one to one map from the nodes of T 1 to the nodes of T 2 that preserves the ancestor relations of the nodes.
Our problem: Let M be a mapping from T 1 to T 2. The Labelled Subtree Homeomorphic Similarity Score of M[T 1,T 2 ] is: LSH (M[T 1,T 2 ]) = D (|T 1 |-|T 2 |) + ∑ (u,v) ∈ M Δ]u,v] Given two undirected labeled trees P and T, We want to find a mapping M and a subtree t of T, such that: LSH (M [t,P]) is maximal.
Scoring Text Pattern Score Score: 2 Score: 2 Score: 5
Dynamic programming v u x1x1 x2x2 y3y3 y2y2 y1y1 T P x1x1 x2x2 …u… y1y1 w 11 w 12 w1mw1m y2y2 w 21 w 22 w1mw1m y3y3 w 31 w 32 w1mw1m … vwn1wn1 wn1wn1 w nm
RScore[u,v] is the maximum between two terms: The node-to-node similarity value Δ [v,u] plus the sum of the weights of the matched edges in the maximal assignment over G. This term is only compute if c(u) ≤ c(v) (otherwise: - ∞). The weight RScore[y i,u] for the comparison of u and the best scoring child y i of v, updated with the penalty for deleting v. C(u) is the number of the children of u
RScore[u,v] - example Pattern Text score matrix deletion = -1 ab u10 -∞-∞ v9 deletion 8 w8 deletion 12 ab u10 v5-2 w33 a b u v w Max {5,10-1} = 9 Max {3,9-1} = 8
The assignment problem Let G be a bipartite graph G = (V = X U Y,E) with weights w (x,y) for all edges. The assignment problem is to compute a matching M (list of monogamic pairs) such that: The size of M is maximal among all the matchings. From all the matchings above, The sum of the weights is maximal.
Solving the assignment problem Reduction from the assignment problem to the min cost max flow problem. We’ll construct G’ which contains G(V,E) with the following changes: Two more vertexes: s,t Edges from s to X and from Y to t, while w (s,x) = 0, w (y,t) =0 The cost of the other edges in E is –w (x,y) The capacity of all edges is 1 What is it? Among all the maximal flows we’ll choose the cheapest
From assignment to matching u x1x1 x2x2 v y3y3 y1y1 y2y2 x1x1 x2x2 y2y2 y2y2 y2y2 s t
Time complexity analysis Edmonds and Karp’s algorithm: O(EV*logV) Fredman and Tarjan: O(VE + V 2 logV) (independent of the edges cost) Gabow and Tarjan: O(V 1/2 Elog(VC) where the input costs are integers and in the range [-C,….,C] (the similarity assumption)
Reminder…
What did we have so far? Motivation “Advanced” homeomorphism: labels and topology Scoring and deletion Dynamic programming Matching Questions?
The algorithm for rooted unordered trees: Input: Rooted trees T = (V T,E T,r) and P = (V P,E P,r’ )). Output: The root of the subtree t of T which has the highest similarity score to P, (and homeomorphic to P).
for each node u of P in postorder do for each node v of T in postorder do if u is leaf then if v is leaf then RScore(v, u) = Δ [v,u] else RScores(v,u) = ComputeScores (v,u) end if else if Level(u) > Level(v) then RScore(v, u) = -∞ else RScores(v,u) = ComputeScores (v,u) end if ; end if; end for; end for Dynamic programming Node to node score Delete from the pattern
Let k denote the out-degree of node u and l denote the out degree of node v if k >l then AssignmentScore(G) = -∞ else Construct a bipartite graph G with node bipartition X and Y such that: X is the set of children {x 1 … x k { of u, Y is the set of children {y 1 … y l { of v, node u i ∈ X X is connected to node v j ∈ Y via an edge whose weight w(u i,v j ) is set to RScore(v j,u i ). AssignmentScore(G) = max ∑ (i,j) ∈ M RScores[y j,x i ] end if Find, among all children of v, the node BestChild(v,u) whose ALSH score with u is highest: BestChild(v,u) = max j=1 to l RScore(y j, u) return max {Δ [v,u]+AssignmentScore(G),BestChild(v,u)+δ} Procedure ComputeScores (v,u) Deletion penalty
Time complexity analysis Observation 1: ∑ u =1 to m c(u) = m-1 ∑ v =1 to n c(v) = n-1 The number of the vertexes in the pattern
Time complexity analysis The weighted assignment is computed once for each pair u,v u T, v P In a bipartite graph there are c(v)+c(u) nodes and c(v) c(u) edges. Based on Fredman and Tarjan the time complexity is:Fredman and Tarjan O(∑ u=1 to m ∑ v=1 to n )c(u) 2 )c(v)+c(u)c(v) log (c(v)) = (observation 1) O(∑ u=1 to m c(u) 2 )n+c(u)n log n) = (observation 1) O(m 2 n + mn log n)
Unrooted unordered trees: The problem: each vertex in both the text tree and the pattern tree can be the root. The naïve solution: choose an arbitrary node r of T to get a rooted tree. Next, for each u P compute rooted ALSH between P u and T r. Time complexity: O(m 3 n+m 2 n log n)
2 nd try: Select an arbitrary node r in T as the root For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v
2 nd try: Select an arbitrary node r in T as the root For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v
2 nd try: Select an arbitrary node r in T as the root For each internal node in T (in post order) and for each node in P compute an “improved” matching problem u v
General idea for keeping the time complexity Find the best match between the children {x 1,..,x n ) of v ∈ T and {y 1,…,y m } of u ∈ P. After computing the best match and removing a node x i (which act as the parent of u) there is a way to find the optimal matching between {x 1,…,x n }\x i and {y 1,…,y m } in O(d(u)c(v)+c(v) log c(v)) The total time complexity for computing all assignments between v and u: O(d(u) 2 c(v)+d(u)c(v) log c(v))
Time complexity Observation 2: The sum of vertex degrees in an unrooted tree P is ∑ u =1 to m d(u) = 2m-2 We’ve study that at Combinatorics
Time complexity – continue… O((∑ u =1 to m ∑ v =1 ton d(u) 2 c(v))+d(u)c(v) log c(v)) = O((∑ u =1 to m d(u) 2 n +d(u)n log n) = O(m 2 n + mn log n) Observation 1 Observatin 2
Up the tree… For each vertex v ∈ T, u ∈ P and x i ∈ neighbors (u), UScore[v,u, x i [ is the maximal LSH between a subtree p u, x i of P and a corresponding homeomorphic subtree of t v, r if one exists. otherwise, UScore[v,u,x i ] is set to -∞ A subtree in P which his root is u and the root’s parent is x i
UScore[u,v,x i ] is the maximum between two terms: The node-to-node similarity value Δ [v,u] plus the sum of the weights of the matched edges in the maximal assignment over G i. This term is only compute if d(u) - 1 c(v) (otherwise: - ∞). The weight UScore[y i,u,x i ] for the comparison of u and the best scoring child y i of v, updated with the penalty for deleting v. d(u) is the degree of u
And if ‘u’ is the root… We have to compute an additional entry UScore[v,u,Φ]. This entry represent the fact that u might be the root of P. The root of P will be node u such that: UScore[v,u,Φ] is maximal.
Multi-source graphs DAG = Directed Acyclic Graph. A multi-source tree is a DAG whose its underlying structure is an unrooted, unordered trees.
Multi-source graph - example pattern text UScore[u,v,r’] = -∞ r’ r u v
Multi-source graphs & alignment We’ll use the algorithm for the unrooted unordered tress. We’ll filter out subtree alignments that map together edges of conflicting direction. We’ll split the bipartite graph G = {X U Y,E} into two different graphs: one correspond to macthing of incoming-edge neighbors of u and v and the other for matching outgoing edge neighbors.
ALSH for ordered rooted trees
Solving ALSH for ordered rooted trees Maximum weighted matching problem on ordered bipartite graphs, where no edges are allowed to cross. Given a pattern string X, a source Y, and a character to character similarity table Δ[∑ X, ∑ Y ], find among all |X|-sized subsequences of Y the subsequence Q which is most similar to X, that is, the sum ∑ i=1 to |X| Δ[Q i,X i ] is maximized.
String alignment y3y3 y2y2 y1y1 k i +1 y1y1 y2y2 y3y3 x1x1 x2x2 l j +1 -∞ 000 ∆ We can’t delete nodes from the pattern tree This is NOT the deletion penalty
Time complexity for rooted ordered For each node pair (v ∈ T,u ∈ P), the time complexity of the assignmentb is O(c(u) c(v)) (dynamic programming) ∑ u =1 to m ∑ v =1 to n O(c(v) c(u)) = ∑ v =1 to n O(m c(v)) = O(m n) Observation 1
The tool: MetaPathwayHunter
What can it do? A pathway against a pathway - 5 best alignments. A pathway against a directory of pathways – 5 best alignment for pathway in the directory (sorted by score).
Two extreme cases of deletion penalty Assuming the similarity score is negative ( ≤ 0) Deletion penalty 0: always worth deleting Deletion penalty -∞ : never worth deleting
Deletion penalty 0 What does it mean?
Deletion penalty -∞ What does it mean?
About the similarity score MetaPathwayHunter uses the EC (Enzyme Commission) classification. Four sets of numbers that categorize the type of the catalyzed chemical reaction. (e.g ). For an enzyme class h, C(h) denotes the number of enzymes whoose classes are included under h. For two enzymes e i and e j, if their lowest common upper class is h ij, then the similarity between then is –log 2 C(h).
Similarity score - example Δ[ , ] = -log 2 C( ) =-log 2 (14)= Δ[ , ] = -log 2 C( ) = -log 2 (20) = These are not enzymes
Is the result statistically significant? Statistical significance is base on p-value. The p-value of an alignment (scored s) is calculated by aligning the same query against 100 random pathway graphs, and counting the fraction of graphs containing an alignment that receive score s or higher. A random pathway is a graph containing the same set of nodes and the same number of edges for each node, with random switch of the nodes.
Inter species alignment 113 E. coli pathways and 151 S. cerevisiae pathways. 610 pathway pairs had at least one statistically significant alignment between them. 63% of the E. coli and 66% of S. cerevisiae had at least one statistically significantly aligned pair-mate from the other species
Inter species alignment E. Coli & S. cerevisiae: Phenilalanine, tyrosine and tryptophan pathway (score: -4.28) from [1]
Inter species alignment What is the single mismatch? In E. coli: the enzyme uses NAD+ In S. cerevisia: the enzyme uses NADP+ These two enzyme doesn’t have a significant sequence similarity. == Two functional orthologs.
A meta-pathway query E. colly allantoin degradation (score =0) S. cerevisia ureide degradation (score=0)
summary Biological motivation Homeomorphism Scoring and deleting From assignment to matching The algorithm for rooted unordered trees How to keep the time complexity for unrooted unordered trees
summary How to deal with Multi-source graphs The algorithm for rooted ordered trees The MetaPathwayHunter and its properties Results of alignments
THE END