O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and.

Slides:



Advertisements
Similar presentations
Parsimony Small Parsimony and Search Algorithms Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

An introduction to maximum parsimony and compatibility
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
The Saitou&Nei Neighbor Joining Algorithm ©Shlomo Moran & Ilan Gronau.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Bioinformatics Algorithms and Data Structures
Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Linear Least Squares and its applications in distance matrix methods Presented by Shai Berkovich June, 2007 Seminar in Phylogeny, CS Based on the.
Multiple sequence alignment
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic trees Sushmita Roy BMI/CS 576
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
PHYLOGENETIC TREES Dwyane George February 24,
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Fabio Pardi PhD student in Goldman Group European Bioinformatics Institute and University of Cambridge, UK Joint work with: Barbara Holland, Mike Hendy,
Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Phylogenetic Trees - Parsimony Tutorial #12
Phylogenetic basis of systematics
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Inferring a phylogeny is an estimation procedure.
Multiple Alignment and Phylogenetic Trees
BNFO 602 Phylogenetics Usman Roshan.
The Most General Markov Substitution Model on an Unrooted Tree
Phylogeny.
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Discrete Mathematics for Computer Science
Presentation transcript:

O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and Terrell Hodge To appear in Bulletin of Mathematical Biology

Figure 19.1 Genomes 3 (© Garland Science 2007) O RIGINS OF S PECIES

G ENE TREE IN A S PECIES TREE Maddison WP (1997) Gene trees in species trees. Systematic Biology 46:

P HYLOGENETIC R ECONSTRUCTION Observe alignment of DNA for n species. 1 AGCCCGTCGC… 2 AGCTCGTCCC… 3 GGCTCGACCC… n AGCCGGATCC… Find binary tree that best describes the evolutionary history of the n species.

P HYLOGENETIC R ECONSTRUCTION Maximum likelihood estimation methods (MLE): These methods describe evolution in terms of discrete-state continuous-time Markov process. Bayesian inference methods: Use Bayes Theorem and MCMC to estimate the posterior distribution rather than obtaining a point estimation. And distance based methods…

D ISTANCE B ASED M ETHODS Observe alignment of DNA for n species. 1 AGCCCGTCGC… 2 AGCTCGTCCC… 3 GGCTCGACCC… … n AGCCGGATCC… Compute an “evolutionary” distance between each pair of DNA sequences. 12…n … …0.4 … ………… n …0 NOTE: Still need to find binary tree that fits best with this distance matrix.

D ISTANCE B ASED M ETHOD O VERVIEW 1 AGCCCGTCGC… 2 AGCTCGTCCC… 3 GGCTCGACCC… … n AGCCGGATCC… 12…n … …0. 4 … ………… n …0 Find binary tree T that “best” describes the distance matrix D. I.e., consider D fixed and explore all binary trees to find best tree T. Align DNACompute distance matrix D Find binary tree T given D Binary tree here means bifurcating tree.

D ISTANCE M ATRIX (F ROM A T REE ) A distance matrix for a tree T is a matrix D where D ij is the mutation distance between species i and j

B ALANCED M INIMUM E VOLUTION BME is a weighted least squares distance based method which puts more emphasis on the shorter distances. Given a distance matrix D, the BME method can assign edge lengths to any binary tree topology T with n leaves. Goal of BME, given fixed D, is to find the binary tree T with the smallest sum of total branch lengths ∆ D (T) (assigned by BME). min ∆ D (T) for all (2n-5)!! tree topologies. += 1234… … … … …4 5 …………… BME

P AUPLIN ’ S F ORMULA If ∆ D (T) is the sum of branch lengths of the tree topology T estimated by BME given D, then Pauplin’s formula is where W ij (T) = (2) (1−# of branches between i and j in T) for a particular tree topology T.

E XAMPLE For the tree topology above, we have W (T) = (1/2, 1/4, 1/8, 1/8, 1/4, 1/8, 1/8, 1/4, 1/4, 1/2). Index is lexicographic: 01,02,03,04,12,13,…,34.

BME AS A LINEAR PROGRAM Given Pauplin’s formula, the BME method is thus given by the following linear program: such that where We call P n the BME polytope. The set of all objectives D such that T t is minimal is the normal cone at the vertex W(T t ). We call this cone the BME cone of T t.

BME POLYTOPE W. Day (87) showed that finding the tree topology minimizing ∆ D (T) is NP-hard. Current BME software uses hill-climbing heuristics. BME polytope lies in R n(n-1)/2 and is dimension n(n-1)/2 – n. Lemma [Eickmeyer,Yoshida,2008] Vertices of P n are the BME vectors of unrooted binary trees with n leaves. The star phylogeny lies in the interior of the BME polytope, and all other BME vectors lie on the boundary of the BME polytope.

C OMBINATORICS OF THE BME POLYTOPES For up to n = 7 taxa, Eickmeyer et. al. computed BME polytopes and studied their structure. nDimensionF-vector 42(3, 3) 55(15, 105, 250, 210, 52) 69(105, 5460, ?, ?, ?, 90262) 714(945, , ?, ?, ?, ?, ?) All pairs of binary tree topologies T 1, T 2 on n ≤ 6 taxa can be cooptimal. For n = 7, there is one combinatorial type of non-edge.

C OMBINATORIAL TYPE OF NON - EDGE n = 7.

E DGES OF THE BME POLYTOPE We still do not understand all pairs of trees which will form edges on the BME polytope. If we understand the edges, we might be able to devise a competitive alternative to FastME (current software) that improves trees by walking along edges on the BME polytope, rather than performing nearest-neighbor interchange (NNI), or subtree-prune-regraft (SPR) moves. Edge-walking (known as the simplex algorithm in linear programming) works very well in practice.

S UBTREE P RUNE R EGRAFT (SPR) MOVE 1Select a subtree. 2Detach the selected subtree. 3Attempt to regraft it onto another branch of the remaining tree, in such a way that a new tree is formed.

SPR MOVE ADJACENCY This means that a pair of binary tree topologies T 1, T 2 on n taxa adjacent by an SPR move are adjacent by an edge on the BME polytope. Theorem [H, Hodge, and Yoshida, 2010] If a pair of binary tree topologies T 1, T 2 on n taxa are adjacent by a subtree prune regraft (SPR) move then they can be cooptimal in terms of Pauplin’s formula for BME. Theorem [H, Hodge, and Yoshida, 2010] If a pair of binary tree topologies T 1, T 2 on n taxa are adjacent by a subtree prune regraft (SPR) move then they can be cooptimal in terms of Pauplin’s formula for BME.

C OMPARING BME TO N EIGHBOR J OINING Neighbor Joining (NJ) method: A highly popular distance based method used in phylogenetics. [Saito, Nei 1987],[Studier, Keppler 1988]. Given a fixed distance matrix D, NJ computes a tree topology by recursively joining two nodes which are ‘close’. Specifically NJ joins nodes a and b which have minimal Q-value:

NJ: F AST AND C ONSISTENT Nodes a,b are then replaced by a single new node z which is the root of the cherry (a,b), and distances D zk are defined as D zk = D ak + D bk − 2D ab. Neighbor joining is then applied recursively on the remaining nodes, until a binary tree is obtained. Neighbor joining based on elements of the matrix Q is consistent: Given a tree metric D = D T as input, NJ will correctly output tree T.

N EIGHBOR JOINING CONES Elements of Q are linear in the distances. So picking a cherry (a,b) means the distances satisfy linear inequalities. After picking cherry (a,b) and replacing it with a new node z, the new distances D zk are linear in the old distances: D zk = D ak + D bk − 2D ab.

N EIGHBOR J OINING C ONES Thus NJ will output a particular tree topology T, and pick cherries in a particular order, the original distances D ij satisfy certain linear inequalities. These inequalities define a cone (apex 0) in R n(n-1)/2, called a NJ cone. NJ will output a particular tree topology T iff the pairwise distances lies in a union of NJ cones.

I SSUES WITH NEIGHBOR JOINING Neighbor joining is fast and consistent, but it isn’t based on a model of speciation. The NJ algorithm is a greedy algorithm optimizing the BME criteria [Gascuel, Steel 2006] Neighbor joining outputs a tree topology T iff the data lies in a union of cones. The union of these cones need not be convex. In fact NJ is not convex: There are distance matrices D, D’, such that NJ produces the same tree T 1 when run on input D or D’, but NJ produces a different tree T 2 not equal to T 1 when run on the input (D + D’)/2

NJ AND BME CONES This result is particularly important in phylogenetics since this shows that even though the NJ Algorithm is a greedy algorithm, with any order to pick leaf pairs, the NJ Algorithm will return the BME tree for some dissimilarity map. Theorem [H., Hodge, Yoshida (2010)] Given a tree T with any number of taxa, and any particular order σ of picking its pairs of leaves, the BME cone of T and the NJ cone of T and σ has intersection of positive measure. Theorem [H., Hodge, Yoshida (2010)] Given a tree T with any number of taxa, and any particular order σ of picking its pairs of leaves, the BME cone of T and the NJ cone of T and σ has intersection of positive measure.

F ACES OF BME P OLYTOPE A clade of a binary tree T is the subtree given by an internal node and all its decendents. Blue and red boxes are clades, while green is not a clade.

C LADE -F ACES OF THE BME P OLYTOPE Theorem [H., Hodge, Yoshida (2010)] Every disjoint collection of clades C 1,C 2,…,C k gives a face of the BME polytope, Theorem [H., Hodge, Yoshida (2010)] Every disjoint collection of clades C 1,C 2,…,C k gives a face of the BME polytope, We can now describe a large class of faces of the BME polytope. Note: Clade-face is a smaller dimensional BME polytope.

S UMMARY BME is a consistent distance based phylogenetic reconstruction method with strong biological interpretation. BME method is equivalent to LP over the BME polytope. Until recently, nothing was known about this polytope in general. SPR moves are edges of the BME polytope and disjoint clades are faces! We hope to exploit this new knowledge of the BME polytope to develop new algorithms. We strengthened the connection between the hugely popular NJ method and the BME method.

O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE Thank you! To appear in the Bulletin of Mathematical Biology Available: David Haws, University of Kentucky