Fabio Pardi PhD student in Goldman Group European Bioinformatics Institute and University of Cambridge, UK Joint work with: Barbara Holland, Mike Hendy,

Slides:



Advertisements
Similar presentations
Informed Search Algorithms
Advertisements

110/6/2014CSE Suprakash Datta datta[at]cse.yorku.ca CSE 3101: Introduction to the Design and Analysis of Algorithms.
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Traveling Salesperson Problem
Solving Problem by Searching
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
ICS-171:Notes 4: 1 Notes 4: Optimal Search ICS 171 Summer 1999.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
Search Techniques MSc AI module. Search In order to build a system to solve a problem we need to: Define and analyse the problem Acquire the knowledge.
BALANCED MINIMUM EVOLUTION. DISTANCE BASED PHYLOGENETIC RECONSTRUCTION 1. Compute distance matrix D. 2. Find binary tree using just D. Balanced Minimum.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
. Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Modified by Benny Chor, from slides by Shlomo Moran and Ydo Wexler (IIT)
Fast Algorithms for Minimum Evolution Richard Desper, NCBI Olivier Gascuel, LIRMM.
Distance Methods. Distance Estimates attempt to estimate the mean number of changes per site since 2 species (sequences) split from each other Simply.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Probabilistic methods for phylogenetic trees (Part 2)
Building Phylogenies Parsimony 2.
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Tree-Building. Methods in Tree Building Phylogenetic trees can be constructed by: clustering method optimality method.
Distance between tree topologies. D C H Splits A B E F G {A}{BCDEFGH} {B}{ACDEFGH} {AB}{CDEFGH} {C}{ABDEFGH} {CD}{ABEFGH} {ABCD}{EFGH} Each split represents.
Phylogenetic trees Sushmita Roy BMI/CS 576
Flow Models and Optimal Routing. How can we evaluate the performance of a routing algorithm –quantify how well they do –use arrival rates at nodes and.
Why Models of Sequence Evolution Matter Number of differences between each pair of taxa vs. genetic distance between those two taxa. The x-axis is a proxy.
Parsimony and searching tree-space Phylogenetics Workhop, August 2006 Barbara Holland.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and.
Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology.
1 Shanghai Jiao Tong University Informed Search and Exploration.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Informed search algorithms Chapter 4. Best-first search Idea: use an evaluation function f(n) for each node –estimate of "desirability"  Expand most.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Bayes estimators for phylogenetic reconstruction Ruriko Yoshida.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Optimality of the Neighbor Joining Algorithm and Faces of the Balanced Minimum Evolution Polytope Ruriko Yoshida.
Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.
Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Branch and Bound Searching Strategies Updated: 12/27/2010.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
1 Algorithmic aspects of radio access network design in 4G cellular networks David Amzallag Computer Science Department, Technion Joint work with Seffi.
CSCE350 Algorithms and Data Structure Lecture 21 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Branch and Bound Searching Strategies
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CPSC 420 – Artificial Intelligence Texas A & M University Lecture 5 Lecturer: Laurie webster II, M.S.S.E., M.S.E.e., M.S.BME, Ph.D., P.E.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Dynamic Programming for the Edit Distance Problem.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Traveling Salesperson Problem
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Clustering methods Tree building methods for distance-based trees
BNFO 602 Phylogenetics Usman Roshan.
CS 581 Tandy Warnow.
Why Models of Sequence Evolution Matter
Artificial Intelligence
Lecture 7 – Algorithmic Approaches
CS 394C: Computational Biology Algorithms
Incorporating uncertainty in distance-matrix phylogenetics
Presentation transcript:

Fabio Pardi PhD student in Goldman Group European Bioinformatics Institute and University of Cambridge, UK Joint work with: Barbara Holland, Mike Hendy, Nick Goldman The BME criterion for tree reconstruction and a Branch and Bound algorithm for BME-optimal trees.

Balanced Minimum Evolution BME stands for Balanced Minimum Evolution and is a (new) criterion for distance- based tree reconstruction. It is based on Pauplin’s formula, Λ D (T), which estimates the total length of a tree, based on: (1) its topology T, (2) an estimated distance matrix D = (d ij ). [Pauplin 2000 J Mol Evol 51] The objective, like for any other Minimum Evolution (ME) method, is to find a T that minimises Λ D (T) (= “BME score”). What is BME?

Balanced Minimum Evolution Pauplin’s formula. Λ D (T) = ∑ ij w ij (T) d ij How to get it: where w ij (T) = 1 / 2 branches between i and j o(1) o(2) o(3) o(4) o(5) A reasonable estimate of the tree length: Λ o = ½ (d o(1)o(2) +d o(2)o(3) +d o(3)o(4) +d o(4)o(5) +d o(5)o(1) ) = ½ ∑ i d o(i),o(i+1) But Λ o is dependent on the ordering o… Pauplin’s formula can be obtained by averaging over all such o’s. [Semple & Steel 2004 Adv Appl Math 32] It can also be generalised to multifurcating trees, but not relevant here, as it can be proven that BME-optimal trees are always bifurcating.

Balanced Minimum Evolution Neighbor Joining revealed! [Gascuel & Steel 2006 MBE 23] Until recently it was unclear whether NJ implicitly aimed at optimising some criterion. “NJ has some relation to unweighted least squares and some to minimum evolution, without being definable as an approximate algorithm for either” [Felsenstein’s textbook] Recently it was shown that NJ can be seen as a greedy algorithm that aims to minimise the BME score. [Desper & Gascuel 2005 (in MEP book)]

Balanced Minimum Evolution Since NJ tries to (but usually does not) minimise the BME criterion, what about better algorithms for this? Desper and Gascuel’s program FASTME implements: (1) A sequential addition strategy (which I will call Sadd). (2) A hill-climbing search where NNIs are the possible moves (BNNI).

Balanced Minimum Evolution Since NJ tries to minimise the BME criterion, what about better algorithms for this? Desper and Gascuel’s program FASTME implements: (1) A sequential addition strategy (which I will call Sadd). (2) A hill-climbing search where NNIs are the possible moves (BNNI).

Balanced Minimum Evolution Since NJ tries to minimise the BME criterion, what about better algorithms for this? Desper and Gascuel’s program FASTME implements: (1) A sequential addition strategy (which I will call Sadd). (2) A hill-climbing search where NNIs are the possible moves (BNNI).

NJ % 61.0% BIONJ % 44.6% Sadd % 36.0% NJ+BNNI % 97.9% BIONJ+BNNI % 98.0% Sadd+BNNI % 97.7% BBBME % 100% %61.0% %48.7% %35.5% %98.1% %97.9% %97.8% %100% d RF (T, true T) freq. T opt. d RF (T, true T)freq. T opt. Balanced Minimum Evolution Since NJ tries to minimise the BME criterion, what about better algorithms for this? Desper and Gascuel’s program FASTME implements: (1) A sequential addition strategy (which I will call Sadd). (2) A hill-climbing search where NNIs are the possible moves (BNNI). The results are very good: (2 datasets of 2000 simulated 24-taxon distance matrices each, replicated from Desper and Gascuel 2002 J. Comp. Biol.) Also other papers [e.g. Vinh & von Haeseler 2005 BMC Bio] confirm that X + BNNI outperforms most (all?) existing distance methods.

Balanced Minimum Evolution BNNI performs very well, but it may get stuck in local minima. … constructing low-BME trees is good !!! What about an exact algorithm for this problem? Branch and Bound !!! = explore the “meta-tree”. Every time you enter a new node you assess whether you should go back or continue based on a lower bound LB on the score of the trees below. If LB > current best score, then no optimal tree is below there, so go back. For every T* here, Λ(T*)  LB T

Balanced Minimum Evolution A B&B approach to find BME trees: the bound. If along each path root-leaf the score can only increase then the score of the current tree is a LB. Parsimony has this property but BME doesn’t, unless we assume the triangle inequality… Why? Λ(T) = avg o Λ o = = avg o ½ ∑ i d o(i),o(i+1) i j k Λ’ o - Λ o = ½ (d ik + d kj – d ij ) ≥ 0 For every T* here, Λ(T*)  LB T Λ(T U k) – Λ(T ) = avg o (Λ’ o – Λ o ) ≥ 0

Balanced Minimum Evolution A B&B approach to find BME trees: the bound. Taking that idea further, we can drop the triangle inequality assumption and have that Λ(T U k) – Λ(T) ≥ ½ β k For every T* here, Λ(T*)  LB where β k = min { d ik + d jk – d ij } i,j added before k T Λ(T*)  Λ(T) + ½ ∑ β k k not in T Which is good because: 1)The triangle inequality often does not hold. 2)The ∑β k above is usually positive, so this is a better bound than simply requiring an increase Λ(T*)  Λ(T).

Balanced Minimum Evolution A B&B approach to find BME trees: results and conclusions. I implemented the algorithm in a program called BBBME. This allows us to see how far the heuristics in FASTME are from the optimum. FASTME’s heuristics are very good... The suboptimal trees produced by BNNI seem as good as the optimal trees. Will these results also hold for larger distance matrices (≥ 24 taxa)? NJ % 61.0% BIONJ % 44.6% Sadd % 36.0% NJ+BNNI % 97.9% BIONJ+BNNI % 98.0% Sadd+BNNI % 97.7% BBBME % 100% %61.0% %48.7% %35.5% %98.1% %97.9% %97.8% %100% d RF (T, true T) freq. T opt. d RF (T, true T)freq. T opt. Dataset ‘small’Dataset ‘moderate’ Unfortunately, experimenting with larger distance matrices is hard.

Thanks: Mike Hendy Barbara Holland Nick Goldman David Penny Mike Steel Rick Desper Olivier Gascuel

Running time on 24-taxon distance matrices: each run typically takes only few seconds (on 2.80Ghz CPUs with 1.5GB RAM) But the running time still increases exponentially with the number of taxa: the B&B approach seems applicable up to ~40 taxa…

Balanced Minimum Evolution A Branch and Bound approach to find BME trees: Computational aspects If we are naïve, calculating the BME score Λ(T’) will take O(k 2 ). k leaves T k+1 leaves T’ O(k 2 ) O(k 3 ) However one can use Λ(T), and it turns out that Λ(T’) can then be calculated in O(1).

Balanced Minimum Evolution A Branch and Bound approach to find BME trees: Computational aspects k leaves T k+1 leaves T’ O(1) O(k) If we are naïve, calculating the BME score Λ(T’) will take O(k 2 ). However one can use Λ(T), and it turns out that Λ(T’) can then be calculated in O(1). Λ(T’) = Λ(T) + f(Δ T ) where Δ T is a data structure – of O(k 2 ) size – that needs to be updated for each new T. This takes O(k diamT) = O(k log k). [Desper and Gascuel 2002 J. Comp. Biol.]