Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen

Similar presentations


Presentation on theme: "1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen"— Presentation transcript:

1 1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen stinus@diku.dk

2 2 Overview: What is a phylogeny? The Generalized Tree Alignment problem Sequence Graphs and their algorithms The Deferred Path Heuristic

3 3 Phylogeny: Describes evolutionary model -Common ancestor -Mutations happen all the time -Insertions, deletions, substitutions, translocations, inversions, duplications … Most mutations happen in DNA replication -Corrected by cell mechanisms Mutations accumulate → new species diverge Only mutations in sex cells are inherited (obviously)

4 4 Phylogeny: Phylogenetic inference: Given n sequences build a phylogenetic tree Most methods base T on a multiple alignment Likewise: Multiple alignments often based on guide trees Can we solve both problems at the same time?

5 5 Phylogeny: Describes the evolutionary relationship between species Notice root

6 6 Phylogeny:... or among a single taxon (here, human entovirus 71)

7 7 The Problem: Given n sequences s 1,…,s n … Multiple Alignment: Make an ordering A of the sequences by inserting gaps such that homologous bases are put in the same column Phylogenetic Inference: Build a (binary) tree T with s 1,…,s n in the leaves and possible ancestors s n+1,…,s n+k in internal nodes describing their evolutionary connection

8 8 Generalized Tree Alignment: Combines the two. The problem we want to solve is: Given: A set of n sequences s 1,…,s n from n different species (could be DNA, RNA or protein – for simplicity we focus on DNA) Problem: Generate an unrooted phylogenetic tree T with sequences s 1,…,s n in the leaves and a multiple alignment A of these sequences Placing the root is not trivial and is best left to biologists.

9 9 The given problem is proven to be MAXSNP-hard (Wang and Jiang, 1994) → Not possible to find an approximation algorithm. Exact solutions to NP-hard problems are intractable → The best we can hope for is a heuristic The given algorithm runs in time O(n 2. l n ) n: The number of sequences l: Their maximum length.

10 10 Sequence graphs (Hein, 1989): Recall pairwise alignment. Traceback ”spells” possible optimal alignments:

11 11 Sequence graphs: Make graph with alignment columns as edge labels → represents all optimal alignments We will get back to that shortly … Right now, we want to represent sequences Let us introduce sequence graphs. For instance, s = ACTGTA is represented by:

12 12 Sequence graphs: More formally: Directed, acyclic graph. Edge labels l from alphabet Σ. Here, Σ={ A,C,G,T,- } Source s: The unique node with no incoming edges Sink t: The unique node with no outgoing edges. Each path from s to t spells a sequence.

13 13 Sequence graphs: Represents a set of sequences given by all paths from s to t:

14 14 Sequence graphs: Any single sequence can be represented by a linear sequence graph Any set of k sequences can be represented by making k paths from s to t A given sequence s’ can be represented by more than one path We can now represent sequences – but can we align them?

15 15 Aligning sequence graphs: Dynamic programming algorithm inspired by basic Pairwise Alignment: Given two sequences p and q Move one letter in p and move through q finding the optimal ”partial alignments” Sequence Graphs: Given two sequence graphs G 1 and G 2 We can have many outgoing edges to choose from

16 16 Aligning sequence graphs: Fill in a |V 1 | * |V 2 | score matrix For each pair of nodes i from G 1 and j from G 2 : Should we: Align the two characters we got by following e 1 into i and e 2 into j? Stay in G 1 and only move in G 2 ? Stay in G 2 and only move in G 1 ? Or have we already found a better path into i and j?

17 17 Optimal Alignment Graphs: Now we need a way to remember the optimal alignments Recall graphs from before: Directed, acyclic graphs Nodes s and t defined as before Edge labels of the form [ l a,l b ] where l a,l b ∊ Σ Backtrack through the matrix and consider each possible combination of edges.

18 18 Optimal Alignment Graphs: An example of an OAG: This one represents the alignments: We denote such a graph A* We have to convert the OAGs back to SGs

19 19 Optimal Alignment Graphs: This is done easily by considering the edge labels: If l a = l b : Make a single edge in the SG with label l a If l a ≠ l b : Make two edges in the SG: One with label l a and one with label l b The graph from before turns into the SG:

20 20 Summing up Sequence Graphs: Final graph represents all sequences giving an optimal alignment between G 1 and G 2 We can: Represent a set of sequences by a sequence graph Align two such graphs producing a new SG We can now get on with the main algorithm

21 21 The basic idea: Start by comparing all sequences –Find a closest pair. Represent all sequences giving the optimal solution –Defer the choice of a single sequence Repeat, but this time include the set of sequences In the end: Choose a single sequence and backtrack This shows a need for: -A compact representation of many sequences -An algorithm for aligning sets of sequences

22 22 The Deferred Path Heuristic: Similar to Kruskal’s algorithm for finding MSTs: From sequences s 1,…,s n,initialize n SGs G 1,…,G n. Until only two SGs remain: Align all pairs and choose a closest pair G i and G j Create A * (G i,G j ) and convert A * into a SG G k. Replace G i and G j with G k Note that we remember all candidate sequences

23 23 The Deferred Path Heuristic: When only two SGs G i and G j remain: Align them and connect them in T Choose some optimal alignment –This gives s i and s j in the root of the two subtrees. Backtrack through the subtrees –At each step: Align s k to the underlying SGs. –Choose some optimal alignment

24 24 The Deferred Path Heuristic: We defer our choice of actual sequences until the last moment, thereby enlarging our solution space:


Download ppt "1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen"

Similar presentations


Ads by Google