20 years and 22 papers with Bernard Moret

20 years and 22 papers with Bernard Moret
Tandy Warnow The University of Illinois at Urbana-Champaign

Brief history We met in 1992, when I was working in the Discrete Algorithms Group at Sandia National Labs. I moved to the University of Pennsylvania in 1993, then to the University of Texas at Austin in No papers yet… we’re just friends.

In 1999, Bob Jansen needed help with 13 Campanulaceae genomes

, 22 papers with Bernard Genome rearrangement phylogeny estimation Phylogenetic network estimation Comparing phylogenetic methods on large datasets Absolute fast converging methods

Highlights (this talk)
Genome rearrangement phylogeny estimation Phylogenetic network estimation Comparing phylogenetic methods on large datasets Absolute fast converging methods

Highlight #1 Genome rearrangement phylogeny estimation
Too many papers to list

In 1999, Bob Jansen needed help with 13 Campanulaceae genomes

Genome Rearrangement Phylogeny
B C D E F A B C D E F X Y Z W Sturtevant and Dobzhansky late 1920’s or early 1930’s

Breakpoint Phylogeny Proposed by Sankoff and Blanchette in J Comp. Biol. 1998 Input: Chromosomes given as signed gene orders, one copy of each gene in each chromosome Output: Tree with the minimum number of breakpoints

BPAnalysis: Heuristic for the Breakpoint Phylogeny
Sankoff and Blanchette, 1998

Sankoff and Blanchette, 1998 Its’ik Pe’er for NP-hardness or Caprara, Sankoff and Blanchette for TSP Finding the breakpoint median of three genomes is NP-hard (Pe’er and Shamir 1998), but can be solved using TSP (Travelling Salesman Problem) solvers (Blanchette and Sankoff 1997).

Sankoff and Blanchette, 1998 Bernard and I estimated that BPAnalysis would take ~200 CPU years to complete on Bob’s dataset.

MPBE Maximum Parsimony on Binary Encoding
Character for every possible adjacency (is oriented gene x followed by oriented gene y?) Mary Cosner PhD dissertation 1993 Fast because maximum parsimony heuristics are relatively efficient Can find infeasible solutions MPBE on Bob’s dataset suggested transpositions, which was highly surprising.

MPBE Maximum Parsimony on Binary Encoding
Character for every possible adjacency (is oriented gene x followed by oriented gene y?) Mary Cosner PhD dissertation 1993 Fast because maximum parsimony heuristics are relatively efficient Can find infeasible solutions MPBE on Bob’s dataset suggested transpositions, which was highly surprising. MPBE on Bob’s dataset suggested transpositions – very surprising.

Neighbor Joining on Breakpoint Distances
Find out who’s the person who did this

Phylogeny reconstruction in 1999
Distance-based Breakpoint (BP) distances [Blanchette, Kunisawa, Sankoff 1999] Breakpoint tree (NP-hard, even for three taxa) BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive search through treespace to find the minimum breakpoint length (the number of breakpoints on the tree) MPBE [Cosner 1993]: maximum parsimony on binary encoding Previously, these are attempts to build phylogeny using gene order data Distance based: NJ(BP) MP: try to solve exactly but either of the programs can only handle up to 15 taxa generally, and has problems when branch length is long.

Phylogeny reconstruction in 1999
Distance-based Breakpoint (BP) distances [Blanchette, Kunisawa, Sankoff 1999] – fast but high error Breakpoint tree (NP-hard, even for three taxa) BPAnalysis: [Sankoff & Blanchette 1998]: exhaustive search through treespace to find the minimum breakpoint length (the number of breakpoints on the tree) – too slow MPBE [Cosner 1993]: maximum parsimony on binary encoding: can find infeasible ancestors Previously, these are attempts to build phylogeny using gene order data Distance based: NJ(BP) MP: try to solve exactly but either of the programs can only handle up to 15 taxa generally, and has problems when branch length is long.

The challenges! Find all the best genome trees for Bob’s dataset, and determine if inversions suffice (or if we really do need transpositions). Design statistically rigorous methods for genome rearrangement phylogeny. Design efficient techniques to enable genome-scale phylogeny for large datasets that are difficult to analyze.

Genomes Evolve by Rearrangements
–7 –6 –5 –4 10 –8 –7 –6 – Inversion (Reversal) Transposition Inverted Transposition

From Moret and Warnow, Methods in Enzymology 2005

Generalized Nadeau-Taylor (GNT) Model
Proposed in Wang and Warnow, STOC 2001. Each type of event (inversion, transposition, and inverted transposition) has a probability of occurring, and is specified by GNT(a,b,c): a+b+c=1 All events of the same type are equiprobable The tree has branch lengths indicating the expected number of events on each branch.

Distance-based methods
Breakpoint distances (Blanchette, Bourque, and Sankoff 1997) Inversion distances (Bader, Moret, and Yan 2001) EDE (Empirically-Derived Estimator of true evolutionary distance), Moret et al., ISMB 2001 – derived from inversion-only model IEBP (Wang and Warnow STOC 2001): estimates true evolutionary distance but needs to know or estimate the GNT parameters.

Figure 3 from Moret and Warnow, 2004

All these methods are polynomial time.
40 taxa, 120 genes Inv.:Transp.:InvTransp =2:1:1 Birth-death trees, expected deviation from ultrametricity=2 Here is the simulation result for Neighbor joining using breakpoint distance. The model tree has 40 gene orders, each gene order consists of 120 genes, the typical number of genes in plant chloroplasts. The y-axis is the error of the NJ tree, i.e. false negative rate; the x-axis is the maximum pairwise inversion distance of the 40 taxa; it is normalized so the value is between 0 and 1. This is the diameter of the data, it is also a good indicator on the amount of evolution in the dataset. The higher it is, the higher the amount of evolution, and usually the more difficult to reconstruct the tree. We notice for NJ(BP), the error exceeds 10 % when the diameter exceeds 0.7. If we use NJ(INV), the error is improved; but when the diameter is above 0.8, the error is above 10%. If we switch to a new distance called EDE, which I’ll discuss later. We see the accuracy is further improved. Compared with NJ(BP), the error is essentially halved. Now let me show you how EDE is derived, as well as other methods that yield more accurate trees. Amount of evolution BP=breakpoint distance INV=inversion distance EDE: statistically-based estimator [Wang et al. ‘01] - highly robust. All these methods are polynomial time.

Benchmark gene order dataset: Campanulaceae
12 genomes + 1 outgroup (Tobacco), 105 gene segments NP-hard optimization problems: breakpoint and inversion phylogenies (techniques score every tree) Joint work with Bob Jansen, Linda Raubeson, Jijun Tang, and Li-San Wang 1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.) 2000: Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine: 2 minutes (200,000-fold speedup per processor) 2003: Using latest version of GRAPPA: 2 minutes on a single processor (1-billion-fold speedup per processor)

Moret et al. breakpoint phylogeny approach: 2,000,000-fold speedup over BPAnalysis as serial codes (parallelism brings it higher) From Moret, Tang, and Warnow, 2004

Bounding Upper bound on the best score: score the NJ(EDE) tree using improved implementation of BPAnalysis. Lower bound on a given tree T using the circular ordering on leaves: greedy technique to find the planar embedding that achieves the highest breakpoint score for its circular odering – half that is a lower bound on T’s breakpoint score. By the way, Bernard and I had a big argument about using this lower bound… he didn’t think it would work, but it did!

Bounding Upper bound on the best score: score the NJ(EDE) tree using improved implementation of BPAnalysis. Lower bound on a given tree T using the circular ordering on leaves: greedy technique to find the planar embedding that achieves the highest breakpoint score for its circular ordering – half that is a lower bound on T’s breakpoint score. By the way, Bernard and I had a big argument about using this lower bound… he didn’t think it would work, but it did!

GRAPPA Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms Heuristics for NP-hard optimization problems Uses high-level algorithmic ideas with low-level algorithms engineering to dramatically speed-up the searches for the breakpoint and inversion phylogenies. Project leader: Bernard Moret Now I’d like to talk about some applications to real data. GRAPPA is a software developed by my collaborator Bernard Moret, David Bader, and their students at U. New Mexico. The software is an improved version of BPAnalysis, originally developed by Sankoff et al. It uses a parsimony-based approach, which means it is very slow. The aim is to find a tree topology where we can explain the dataset with the minimum sum of distances on the tree. The sum of distances on the tree is also called tree length; I will explain it more in the next slide. There are two levels of difficulty: given a tree topology, finding the minimal way of embedding the dataset is NP-hard; and exploring all possible tree candidates is NP-hard – the number of candidates is exponential in the number of taxa. To speed up this process we use a heuristic called the circular lowerbound technique that is simple yet very useful.

Benchmark gene order dataset: Campanulaceae
12 genomes + 1 outgroup (Tobacco), 105 gene segments NP-hard optimization problems: breakpoint and inversion phylogenies (techniques score every tree) Joint work with Bob Jansen, Linda Raubeson, Jijun Tang, and Li-San Wang 1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.) 2000: Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster machine: 2 minutes (200,000-fold speedup per processor) 2003: Using latest version of GRAPPA: 2 minutes on a single processor (1-billion-fold speedup per processor)

In 1999, Bob needed help with 13 Campanulaceae genomes

The challenges! Find all the best genome trees for Bob’s dataset, and determine if inversions suffice (or if we really do need transpositions). Design statistically rigorous methods for genome rearrangement phylogeny. Design efficient techniques to enable genome-scale phylogeny for large datasets that are difficult to analyze.

What we found Optimal solutions for the breakpoint phylogeny, for the inversion-only phylogeny, and for a weighted sum of inversions and transpositions. 67 inversions suffices for this dataset, and no transpositions are needed!

This was just the beginning…
Other events, such as Duplications, Insertions, and Deletions Fissions and Fusions Other models HP (Hannenhali and Pevzner) Double Cut-and-Join (Yancopoulos et al.) Other techniques DCM-boosting to scale GRAPPA to large datasets (1000 species) Estimating true evolutionary distances under these complex models Inferring ancestral genomes under these complex models Bernard has made progress on all these things!! (But without me, so someone else can talk about this…)

Highlight #2 Absolute fast converging methods
Warnow, Moret and St. John, SODA 2001 Nakhleh, Moret, etc. PSB 2002 Moret, Roshan, and Warnow, WABI 2002 Moret, Wang, and Warnow 2002 (IEEE Computer)

Markov Model of Site Evolution
Simplest (Jukes-Cantor, 1969): The model tree T is binary and has substitution probabilities p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleotides) If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. The different sites are assumed to evolve independently and identically down the tree (with rates that are drawn from a gamma distribution). More complex models (such as the General Markov model) are also considered, often with little change to the theory.

Quantifying Error FN FP FN: false negative (missing edge)
FP: false positive (incorrect edge) 50% error rate FP

Statistical consistency
error Data

Distance-based estimation

Neighbor Joining (NJ) on large trees
Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

In other words… error Data
Statistical consistency doesn’t guarantee accuracy w.h.p. unless the sequences are long enough.

Sequence length requirements
The sequence length (number of sites) that a phylogeny reconstruction method M needs to reconstruct the true tree with probability at least 1- depends on M (the method)  f = min p(e), g = max p(e), and n, the number of leaves We fix everything but n.

Neighbor Joining’s sequence length requirement is exponential!
Atteson 1999: Let T be a Jukes-Cantor model tree defining additive matrix D. Then Neighbor Joining will reconstruct the true tree with high probability from sequences that are of length O(lg n emax Dij). Lacey and Chang 2009: Matching lower bound

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

“Boosting” phylogeny reconstruction methods
DCMs “boost” the performance of phylogeny reconstruction methods. DCM Base method M DCM-M

Distance-based estimation

Divide-and-conquer for phylogeny estimation

Divide-and-conquer for phylogeny estimation
Construct subset trees Supertree Step Refinement Step

DCM1 Decompositions the threshold graph is provably triangulated).
Input: Set S of sequences, distance matrix d, threshold value 1. Compute threshold graph 2. Perform minimum weight triangulation (note: if d is an additive matrix, then the threshold graph is provably triangulated). DCM1 decomposition : Compute maximal cliques

DCM1-boosting: Warnow, St. John, and Moret, SODA 2001
Absolute fast converging (DCM1-boosted) method Exponentially converging (base) method DCM1 SQS The DCM1 phase produces a collection of trees (one for each threshold), and the SQS phase picks the “best” tree. For a given threshold, the base method is used to construct trees on small subsets (defined by the threshold) of the taxa. These small trees are then combined into a tree on the full set of taxa.

Neighbor Joining on large diameter trees
Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001]
Theorem (Warnow, Moret, and St. John, SODA 2001): DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001]
Theorem (Warnow, Moret, and St. John, SODA 2001): DCM1-boosting: reducing sequence length requirements for gene tree accuracy w.h.p. from exponential to polynomial 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Highlight #3

Building the Tree of Life: A National Resource for Phyloinformatics and Computational Phylogenetics
NSF Large ITR grant ($11.6 Million), Directors: Bernard Moret ( ) and Tandy Warnow ( ) Other key personnel: Mark Holder, Junhyong Kim, Wayne Maddison, Mark Miller, Brent Mishler, Satish Rao, David Swofford, and Val Tannen.

CIPRES: graduate students and postdocs
Francois Barbancon, Nicholas Bray, Kevin Chen, Shirley Cohen, Costis Daskalakis, Nick Eriksson, Yu Fan, Kirsen Fisher, Ganesh Ganapathy, Sheng Guo, Tracy Heath, Cameron Hill, David Kysela, Ruth Kirkpatrick, Henry Lin, Kevin Lin, Wenguo Lin, Andrew McGregor, Frank Mannino, Rui Mao, Radu Mihaescu, Eric Miller, Luay Nakhleh, Manikandan Narayanan, Serita Nelesen, Smriti Ramakrishinan, Samantha Riesenfeld, Sebastien Roch, Usman Roshan, Ariel Schwartz, Stephen Smith, Errol Strain, Jeet Sukumaran, Shel Swenson, Kunal Talwar, Andres Varon, Rutger Vos, Yifeng Zheng, and Derrick Zwickl Michael Alfaro Mark Holder Peter Midford Sagi Snir Shel Swenson Rutger Vos

CIPRES: graduate students and postdocs
Francois Barbancon, Nicholas Bray, Kevin Chen, Shirley Cohen, Costis Daskalakis, Nick Eriksson, Yu Fan, Kirsen Fisher, Ganesh Ganapathy, Sheng Guo, Tracy Heath, Cameron Hill, David Kysela, Ruth Kirkpatrick, Henry Lin, Kevin Lin, Wenguo Lin, Andrew McGregor, Frank Mannino, Rui Mao, Radu Mihaescu, Eric Miller, Luay Nakhleh, Manikandan Narayanan, Serita Nelesen, Smriti Ramakrishinan, Samantha Riesenfeld, Sebastien Roch, Usman Roshan, Ariel Schwartz, Stephen Smith, Errol Strain, Jeet Sukumaran, Shel Swenson, Kunal Talwar, Andres Varon, Rutger Vos, Yifeng Zheng, and Derrick Zwickl Michael Alfaro Mark Holder Peter Midford Sagi Snir Shel Swenson Rutger Vos And many others who were funded by other grants 225 papers and dissertations were supported by CIPRES, many by CIPRES students and postdocs

The CIPRES Gateway SDSC Supercomputers, CIPRES Gateway Help Define
New “Tree of Life” SDSC Press Release, April 25, 2016 (Figure courtesy of Laura Hug, Jill Banfield, and Nature Microbiology)

Bernard isn’t really retiring
He’s just going to be working from a warmer place… We’ll keep him busy with collaborations via internet and phone… And if we have to, we’ll go visit him in his new home.

Bon Voyage, Bernard!

20 years and 22 papers with Bernard Moret

Similar presentations

Presentation on theme: "20 years and 22 papers with Bernard Moret"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

20 years and 22 papers with Bernard Moret

Similar presentations

Presentation on theme: "20 years and 22 papers with Bernard Moret"— Presentation transcript:

Similar presentations

About project

Feedback