New methods for simultaneous estimation of trees and alignments

New methods for simultaneous estimation of trees and alignments
Tandy Warnow The University of Texas at Austin Joint work with K. Liu, S. Raghavan, S. Nelesen, and C.R. Linder

Phylogeny (evolutionary tree)
Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Evolutionary History Helps us
predict gene function develop drugs and vaccines understand disease spread understand human origins Phylogenetics: estimating evolutionary histories All organisms on earth are related through their evolutionary history. Understanding these relationships is critical to answering many biological and biomedical questions. For example, evolutionary histories can be used to develop effective vaccines, and to understand the spread of a disease through a population. The field responsible for estimating these evolutionary histories is Phylogenetics, and the histories themselves are called phylogenies or phylogenetic trees. One of the big goals in phylogenetics is to construct the tree of life, a phylogenetic tree that relates all organisms on earth. Supertrees methods are one tool for reconstructing very large phylogenies, and will be critical in progressing towards constructing the tree of life. My focus has been to develop supertree methods that can help us get closer to this goal. Before we see how supertree methods can be used to construct large-scale phylogenies, let’s look at a small example of a phylogeny. Tree of Life From AToL website

Many, Many Trees (2n-5)! 2 (n-1)! # of Species # of Unrooted Trees 4 3
15 6 105 7 945 8 10,395 9 135,135 10 2,027,025 20 2.2 x 10 100 4.5 x 10x 1000 2.7 x 10 x 190 2900 (2n-5)! 2 (n-1)! n-1

Computational phylogenetics
Heuristics for NP-hard problems Statistical aspects of estimation methods under Markov models of evolution Detection and reconstruction of reticulate evolution Comparative genomics Datamining sets of trees and alignments Visualization of large trees and alignments Supertree methods Simultaneous estimation of trees and alignments (plus similar issues for historical linguistics)

This talk SATé (Simultaneous Alignment and Tree Estimation), Science 2009, Liu et al. Standard methods for estimating sequence alignments and phylogenetic trees have unacceptably high error SATé produces high accuracy with 24 hour analyses on datasets with up to 1000 taxa New methods (unpublished), both more accurate and faster than SATé SATé-II BLuTGEN (only produces a tree) Potential for major impact for large-scale phylogeny estimation

DNA Sequence Evolution
AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

U V W X Y X U Y V W AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT
Unrooted trees are equivalent to a rooted tree with branch lengths and the root node omitted. Rooting is a separate problem which is not discussed in this talk. V W 9

Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA… 10

…ACGGTGCAGTTACCA… …ACGGTGCAGTTACCA… …AC----CAGTCACCA… …ACCAGTCACCA…
Deletion Mutation …ACGGTGCAGTTACCA… …ACGGTGCAGTTACCA… …AC----CAGTCACCA… …ACCAGTCACCA… The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree Homology = nucleotides lined up since they come from a common ancestor. Indel = dash. 11

Input: unaligned sequences
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA
S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 S2 S4 S3

Many methods Phylogeny methods Bayesian MCMC Maximum parsimony
Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Alignment methods Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Opal Etc. RAxML: best heuristic for large-scale ML optimization 15

Question: How well do two-phase methods perform?
ROSE simulation: 1000, 500, and 100 sequences Evolution with substitutions and indels Varied gap lengths, rates of evolution Estimated alignments using leading methods Used RAxML to compute trees Recorded tree error (missing branch rate) Recorded alignment error Liu et al., Science 2009

Simulation Studies Unaligned Sequences Compare True tree and alignment
S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S2 S3 S4 S1 S4 S3 S2 Compare True tree and alignment Estimated tree and alignment

Quantifying Error True Tree Estimated Tree
C A B C D E A D FN B E True Tree Estimated Tree False negative (FN) - aka “missing branch” : An edge in the true tree missing from the estimated tree

1000 taxon models, ordered by difficulty
1. 2 classes of MC: easy, moderate-to-difficult 2. true alignment 3. 2 classes: ClustalW, everything else Alignment error, measured this way, isn't a perfect predictor of tree error, measured this way. 1000 taxon models, ordered by difficulty 19

Problems with the two-phase approach
Manual alignment is time consuming and subjective. Current alignment methods fail to return reasonable alignments on large datasets with high rates of indels and substitutions. Systematists discard potentially useful markers if they are difficult to align. This issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)

S1 S4 S2 S3 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA and S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA Statistical simultaneous estimation methods (BALiPhy, Alifritz, Statalign) are not scalable. POY and related methods are not more accurate than standard two-phase methods. 21

SATé: (Simultaneous Alignment and Tree Estimation)
Liu, Nelesen, Raghavan, Linder, and Warnow Search strategy: search through tree space, and realigns sequences on each tree using a novel divide-and-conquer approach. Optimization criterion: alignment/tree pair that optimizes maximum likelihood under GTR+Gamma (RAxML GTRMIX, treating gaps as missing data). Science, 19 June 2009, pp

SATé Algorithm Tree Obtain initial alignment and estimated ML tree

SATé Algorithm Obtain initial alignment and estimated ML tree Tree
Use tree to compute new alignment Alignment

Estimate ML tree on new alignment Use tree to compute new alignment Alignment

Estimate ML tree on new alignment Use tree to compute new alignment Alignment If new alignment/tree pair has worse ML score, realign using a different decomposition Repeat until termination condition (typically, 24 hours)

SATé iteration (Actual decomposition produces 32 subproblems)
Merge subproblems Estimate ML tree on merged alignment Decompose based on input tree Align subproblems ABCD e

Say the only thing changing here is the alignment method. 28

For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? 24 hour SATé analysis, on desktop machines (Similar improvements for biological datasets)

Why is SATé so accurate? Not because it’s good to optimize ML under a model treating gaps as missing data - we prove that this optimization problem is a bad approach. Instead, the key is the divide-and-conquer strategy (with ML giving a small but statistically significant additional improvement).

SATé iteration (Actual decomposition produces 32 subproblems)
Merge subproblems Estimate ML tree on merged alignment Decompose based on input tree Align subproblems ABCD e

Since the Science paper…
Two new (unpublished) methods, both more accurate than the Science SATé: SATé-II: different re-alignment strategy, but same general algorithm design. Fast version gives highly accurate results in just a few hours on datasets with 1000 taxa. BLuTGEN: designed for very large datasets, or those where an alignment on the entire dataset is not possible. Only produces a tree.

SATé-II: same as SATé-I except for decomposition
B D C Merge subproblems Estimate ML tree on merged alignment Decompose based on input tree Align subproblems ABCD e

SATé-2 vs. SATé: only difference is the division into subsets
Recursive decomposition: An edge is taken from the tree, dividing the taxa into two subsets. This is repeated until each set is “small enough” Each subset is aligned, and then alignments are merged into an alignment on the entire dataset.

1000 taxon models ranked by difficulty

SATé-I vs. SATé-II SATé-II Faster and more accurate than SATé-I
Longer analyses or use of ML to select tree/alignment pair slightly better results

Limitations of SATé-I and -II
B D C A B Decompose dataset C D Align subproblems A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

BLuTGEN Input: unaligned sequences
Output: tree on the sequence dataset Note: BLuTGEN does not produce a multiple alignment on the full dataset.

BLuTGEN BLAST-based Existing Method: RAxML(MAFFT) pRecDCM3
Unaligned Sequences Overlapping subsets pRecDCM3 A tree for each subset New supertree method: SuperFine A tree for the entire dataset

1000 taxon models ranked by difficulty
For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? We examine BLuTGEN on the two hardest model conditions 1000S1 = M14 and 1000M1 = M15

Using pRecDCM3 Missing Edge Rate

pRecDCM3 vs. SATé Missing Edge Rate

Further Improvements with Iteration
Missing Edge Rate

BLuTGEN Results shown based upon three iterations, starting with RAxML(MAFFT). Same levels of accuracy obtained when using the BLAST-based approach. Nelesen, Linder, and Warnow. In preparation.

Analyses of Gutell rRNA datasets
BLuTGEN yields dramatic improvements in topological accuracy over two-phase methods, and slight improvements over SATé-I, with just a few iterations. Two largest datasets: 7,350 16S sequences spanning the Tree of Life (all 3 domains) 27,643 16S bacterial sequences.

BLuTGEN vs. SATé BLuTGEN iterations:
1-2.5 hours for 1000 taxon (simulated) datasets 25 hours on 7350 taxon (16S.T) dataset 68 hours on 27,643 taxon (16S B.ALL) dataset SATé-2 iterations: 2-12 hours for 1000 taxon (simulated) datasets 75 hours on the 7350 taxon (16S.T) dataset 253 days on the 27,643 taxon (16S B.ALL) dataset: Alignment production 121 hours RAxML tree estimation 248 days (one month with 8 processors, fastest version of RAxML)

BLuTGEN on 16S.B.ALL 6 iterations of BLuTGEN on a 27,000 taxon dataset; each iteration takes 3 days

Summary Both SATé-I and SATé-II produce more accurate trees and alignments than standard methods, and SATé-II is much faster and more accurate than SATé-I. BLuTGEN is designed for very large-scale phylogeny estimation problems, especially where SATé-II cannot be used (or where its use would require extreme computational resources). Together, these methods make it possible to analyze a much larger range of genes in systematic studies, and may transform Tree of Life efforts.

Related Research SuperFine: new supertree method
Divide-and-conquer methods to speed up Constructing species phylogenies from gene trees Whole genome phylogenetics Historical Linguistics, including reticulate evolution detection and reconstruction Metagenomic analyses, esp. phylogenetic placement

Acknowledgments National Science Foundation: Assembling the Tree of Life (ATOL), ITR, and IGERT grants ( , , ) Students: Kevin Liu, Serita Nelesen, Sindhu Raghavan, and Shel Swenson Randy Linder, Integrative Biology, UT-Austin

New methods for simultaneous estimation of trees and alignments

Similar presentations

Presentation on theme: "New methods for simultaneous estimation of trees and alignments"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

New methods for simultaneous estimation of trees and alignments

Similar presentations

Presentation on theme: "New methods for simultaneous estimation of trees and alignments"— Presentation transcript:

Similar presentations

About project

Feedback