Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider
The problem… Construction of optimal evolutionary trees is NP complete. We want heuristics!
Definition 1 A phylogenetic tree T = (V, E) ‘V’ are the vertices or nodes ‘V’ are the vertices or nodes ‘E’ are the edges ‘E’ are the edges T(S) is a leafset of a tree
Let’s break it down A tree defined by T(S) contains a set of sequences. A sequence is defined as S = {s 1 ….,s n ) A sequence is defined as S = {s 1 ….,s n ) The root of the tree has no relevance in our context. A phylogenetic tree T = (V, E) ‘V’ represents (usually known) ancestor sequences. ‘V’ represents (usually known) ancestor sequences.
Scoring schemes: Parsimony and compatibility methods Distance based methods Maximum likelihood methods
Parsimony Count the number of amino acid or nucleotide substitutions in a weighted or un-weighted manner. Take a multiple sequence alignment (MSA) as input and minimize the number of changes to explain the evolutionary tree. To construct an optimal MSA is also NP complete. RATS! RATS!
Parsimony drawback Many algorithms for calculating a MSA need an evolutionary tree as input. You are only as good as your last model. You are only as good as your last model. DOUBLE RATS! DOUBLE RATS!
Distance Matrix Methods Fit a tree to a matrix of pairwise distances between the sequences. Usually use some form of weighted or un- weighted least squares measure Usually use some form of weighted or un- weighted least squares measure
Distance Matrix Drawbacks To find distances such that the score of the tree is minimized. In order to be truly assured of a minimum value you must try all tree topologies. The number of possible tree topologies grows as you add additional nodes to your tree. The number of possible tree topologies grows as you add additional nodes to your tree.
Maximum Likelihood Method Choose a tree which maximizes the probability that the observed data would have occurred. Generate all possible topologies and use the lengths of the edges that maximize the likelihood.
And the salesman chooses… Maximum Likelihood Method Input: A set of unaligned amino acid sequences A set of unaligned amino acid sequencesOutput: Produce a tree with a minimum score Produce a tree with a minimum score Error checking: That tree is correct if each distance is no greater than x/2, where x is the length of the shortest edge in the tree. That tree is correct if each distance is no greater than x/2, where x is the length of the shortest edge in the tree.
Definition 3 Let T be the set of all possible trees that can be generated for a given set of sequences ‘S’. ‘S’ = {s 1 ….,s n ) ‘S’ = {s 1 ….,s n ) The optimal tree ‘t*’, is a tree such that F(t*) = min F(T). Think golf. The lower the score the better. If your function gives a higher score to better alignments, then multiply it by -1.
Definition 4 The optimal pairwise alignment of two sequences (s 1,s 2 ) is an alignment with the maximum score where a probabilistic scoring method is used. Use PAM distances.
Definition 5 A PAM unit of evolution changes an average percentage of amino acids. The function PAM(s 1,s 2 ), maximizes the optimal pairwise alignment.
Why not use the Sum of Pairs Sum of Pairs is a well known scoring function for MSAs. If we were to add ‘ticks’ when calculating the sum of the edges we would get this…
Sum of Pairs Example
Sum of Pairs Drawbacks There is no theoretical justification to weigh some branches more than others. It is not simply the root that is weighted more than others. Sum of Pairs methods are intrinsically problematic from an evolutionary perspective for scoring MSAs.
So we grab the salesman… (Definition 6) A circular order C(T) of a set of sequences (S) is a tour through the tree T(S) where each edge is traversed exactly twice and each leaf is visited only once. More pictures…
The Tour
Therefore we score our tree… The scoring function is based on the circular order. Add all the PAM distances (represented by the edges) from our circular path. Add all the PAM distances (represented by the edges) from our circular path. Divide by two, because we want to count each edge only once. Divide by two, because we want to count each edge only once.
Does this save time? The problem is basically a symmetric Traveling Salesman Problem (TSP). The problem is to find the shortest route where is city is visited once. *Our cities are amino acid sequences and our distances are the PAM distances of the pairwise alignments.* TSP optimal solutions can be calculated in a few hours for up to 1000 cities. For up to 100 cities it only takes a few seconds. For up to 100 cities it only takes a few seconds. We will rarely have greater than 100 amino acid sequences to compare at any single time.
What about error? Determine how large the distance measurement error may be, such that we still get a correct order. Do the opposite and determine the smallest possible error such that we get a wrong circular order. Do the opposite and determine the smallest possible error such that we get a wrong circular order. This means at least one edge was traversed more than twice. That edge is the smallest edge because we want to find the smallest possible error.
What about error?
If the output of the TSP algorithm is a wrong circular order, then the following inequality must be satisfied…
What about error?
Figure 4
Conclusion: That tree is correct if each distance is no greater than x/2, where x is the length of the shortest edge in the tree. Using the TSA as a heuristic saves us some time, but is not always correct. But it’s better than looking at every possible tree topology!
The End Questions? Comments? Concerns?