Presentation is loading. Please wait.

Presentation is loading. Please wait.

O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and.

Similar presentations


Presentation on theme: "O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and."— Presentation transcript:

1 O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and Terrell Hodge To appear in Bulletin of Mathematical Biology

2 Figure 19.1 Genomes 3 (© Garland Science 2007) O RIGINS OF S PECIES

3 G ENE TREE IN A S PECIES TREE Maddison WP (1997) Gene trees in species trees. Systematic Biology 46: 523-536

4 P HYLOGENETIC R ECONSTRUCTION Observe alignment of DNA for n species. 1 AGCCCGTCGC… 2 AGCTCGTCCC… 3 GGCTCGACCC… n AGCCGGATCC… Find binary tree that best describes the evolutionary history of the n species.

5 P HYLOGENETIC R ECONSTRUCTION Maximum likelihood estimation methods (MLE): These methods describe evolution in terms of discrete-state continuous-time Markov process. Bayesian inference methods: Use Bayes Theorem and MCMC to estimate the posterior distribution rather than obtaining a point estimation. And distance based methods…

6 D ISTANCE B ASED M ETHODS Observe alignment of DNA for n species. 1 AGCCCGTCGC… 2 AGCTCGTCCC… 3 GGCTCGACCC… … n AGCCGGATCC… Compute an “evolutionary” distance between each pair of DNA sequences. 12…n 1 00.3…0.5 2 0.30…0.4 … ………… n 0.50.4…0 NOTE: Still need to find binary tree that fits best with this distance matrix.

7 D ISTANCE B ASED M ETHOD O VERVIEW 1 AGCCCGTCGC… 2 AGCTCGTCCC… 3 GGCTCGACCC… … n AGCCGGATCC… 12…n 1 00. 3 …0. 5 2 0. 3 0…0. 4 … ………… n 0. 5 0. 4 …0 Find binary tree T that “best” describes the distance matrix D. I.e., consider D fixed and explore all binary trees to find best tree T. Align DNACompute distance matrix D Find binary tree T given D Binary tree here means bifurcating tree.

8 D ISTANCE M ATRIX (F ROM A T REE ) A distance matrix for a tree T is a matrix D where D ij is the mutation distance between species i and j. 123456 1 06891211 2 6067109 3 860365 4 973054 5 12106505 6 1195450

9 B ALANCED M INIMUM E VOLUTION BME is a weighted least squares distance based method which puts more emphasis on the shorter distances. Given a distance matrix D, the BME method can assign edge lengths to any binary tree topology T with n leaves. Goal of BME, given fixed D, is to find the binary tree T with the smallest sum of total branch lengths ∆ D (T) (assigned by BME). min ∆ D (T) for all (2n-5)!! tree topologies. += 1234…6 1 0689…11 2 6067…9 3 8603…5 4 9730…4 5 ……………5 6 95450 BME

10 P AUPLIN ’ S F ORMULA If ∆ D (T) is the sum of branch lengths of the tree topology T estimated by BME given D, then Pauplin’s formula is where W ij (T) = (2) (1−# of branches between i and j in T) for a particular tree topology T.

11 E XAMPLE For the tree topology above, we have W (T) = (1/2, 1/4, 1/8, 1/8, 1/4, 1/8, 1/8, 1/4, 1/4, 1/2). Index is lexicographic: 01,02,03,04,12,13,…,34.

12 BME AS A LINEAR PROGRAM Given Pauplin’s formula, the BME method is thus given by the following linear program: such that where We call P n the BME polytope. The set of all objectives D such that T t is minimal is the normal cone at the vertex W(T t ). We call this cone the BME cone of T t.

13 BME POLYTOPE W. Day (87) showed that finding the tree topology minimizing ∆ D (T) is NP-hard. Current BME software uses hill-climbing heuristics. BME polytope lies in R n(n-1)/2 and is dimension n(n-1)/2 – n. Lemma [Eickmeyer,Yoshida,2008] Vertices of P n are the BME vectors of unrooted binary trees with n leaves. The star phylogeny lies in the interior of the BME polytope, and all other BME vectors lie on the boundary of the BME polytope.

14 C OMBINATORICS OF THE BME POLYTOPES For up to n = 7 taxa, Eickmeyer et. al. computed BME polytopes and studied their structure. nDimensionF-vector 42(3, 3) 55(15, 105, 250, 210, 52) 69(105, 5460, ?, ?, ?, 90262) 714(945, 445410, ?, ?, ?, ?, ?) All pairs of binary tree topologies T 1, T 2 on n ≤ 6 taxa can be cooptimal. For n = 7, there is one combinatorial type of non-edge.

15 C OMBINATORIAL TYPE OF NON - EDGE n = 7.

16 E DGES OF THE BME POLYTOPE We still do not understand all pairs of trees which will form edges on the BME polytope. If we understand the edges, we might be able to devise a competitive alternative to FastME (current software) that improves trees by walking along edges on the BME polytope, rather than performing nearest-neighbor interchange (NNI), or subtree-prune-regraft (SPR) moves. Edge-walking (known as the simplex algorithm in linear programming) works very well in practice.

17 S UBTREE P RUNE R EGRAFT (SPR) MOVE 1Select a subtree. 2Detach the selected subtree. 3Attempt to regraft it onto another branch of the remaining tree, in such a way that a new tree is formed.

18 SPR MOVE ADJACENCY This means that a pair of binary tree topologies T 1, T 2 on n taxa adjacent by an SPR move are adjacent by an edge on the BME polytope. Theorem [H, Hodge, and Yoshida, 2010] If a pair of binary tree topologies T 1, T 2 on n taxa are adjacent by a subtree prune regraft (SPR) move then they can be cooptimal in terms of Pauplin’s formula for BME. Theorem [H, Hodge, and Yoshida, 2010] If a pair of binary tree topologies T 1, T 2 on n taxa are adjacent by a subtree prune regraft (SPR) move then they can be cooptimal in terms of Pauplin’s formula for BME.

19 C OMPARING BME TO N EIGHBOR J OINING Neighbor Joining (NJ) method: A highly popular distance based method used in phylogenetics. [Saito, Nei 1987],[Studier, Keppler 1988]. Given a fixed distance matrix D, NJ computes a tree topology by recursively joining two nodes which are ‘close’. Specifically NJ joins nodes a and b which have minimal Q-value:

20 NJ: F AST AND C ONSISTENT Nodes a,b are then replaced by a single new node z which is the root of the cherry (a,b), and distances D zk are defined as D zk = D ak + D bk − 2D ab. Neighbor joining is then applied recursively on the remaining nodes, until a binary tree is obtained. Neighbor joining based on elements of the matrix Q is consistent: Given a tree metric D = D T as input, NJ will correctly output tree T.

21 N EIGHBOR JOINING CONES Elements of Q are linear in the distances. So picking a cherry (a,b) means the distances satisfy linear inequalities. After picking cherry (a,b) and replacing it with a new node z, the new distances D zk are linear in the old distances: D zk = D ak + D bk − 2D ab.

22 N EIGHBOR J OINING C ONES Thus NJ will output a particular tree topology T, and pick cherries in a particular order, the original distances D ij satisfy certain linear inequalities. These inequalities define a cone (apex 0) in R n(n-1)/2, called a NJ cone. NJ will output a particular tree topology T iff the pairwise distances lies in a union of NJ cones.

23 I SSUES WITH NEIGHBOR JOINING Neighbor joining is fast and consistent, but it isn’t based on a model of speciation. The NJ algorithm is a greedy algorithm optimizing the BME criteria [Gascuel, Steel 2006] Neighbor joining outputs a tree topology T iff the data lies in a union of cones. The union of these cones need not be convex. In fact NJ is not convex: There are distance matrices D, D’, such that NJ produces the same tree T 1 when run on input D or D’, but NJ produces a different tree T 2 not equal to T 1 when run on the input (D + D’)/2

24 NJ AND BME CONES This result is particularly important in phylogenetics since this shows that even though the NJ Algorithm is a greedy algorithm, with any order to pick leaf pairs, the NJ Algorithm will return the BME tree for some dissimilarity map. Theorem [H., Hodge, Yoshida (2010)] Given a tree T with any number of taxa, and any particular order σ of picking its pairs of leaves, the BME cone of T and the NJ cone of T and σ has intersection of positive measure. Theorem [H., Hodge, Yoshida (2010)] Given a tree T with any number of taxa, and any particular order σ of picking its pairs of leaves, the BME cone of T and the NJ cone of T and σ has intersection of positive measure.

25 F ACES OF BME P OLYTOPE A clade of a binary tree T is the subtree given by an internal node and all its decendents. Blue and red boxes are clades, while green is not a clade.

26 C LADE -F ACES OF THE BME P OLYTOPE Theorem [H., Hodge, Yoshida (2010)] Every disjoint collection of clades C 1,C 2,…,C k gives a face of the BME polytope, Theorem [H., Hodge, Yoshida (2010)] Every disjoint collection of clades C 1,C 2,…,C k gives a face of the BME polytope, We can now describe a large class of faces of the BME polytope. Note: Clade-face is a smaller dimensional BME polytope.

27 S UMMARY BME is a consistent distance based phylogenetic reconstruction method with strong biological interpretation. BME method is equivalent to LP over the BME polytope. Until recently, nothing was known about this polytope in general. SPR moves are edges of the BME polytope and disjoint clades are faces! We hope to exploit this new knowledge of the BME polytope to develop new algorithms. We strengthened the connection between the hugely popular NJ method and the BME method.

28 O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE Thank you! To appear in the Bulletin of Mathematical Biology Available: http://arxiv.org/abs/1004.2073http://arxiv.org/abs/1004.2073 David Haws, University of Kentucky www.davidhaws.net


Download ppt "O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and."

Similar presentations


Ads by Google