Ultra-Large Phylogeny Estimation Using SATé and DACTAL

Slides:

Advertisements

Similar presentations

New methods for simultaneous estimation of trees and alignments Tandy Warnow The University of Texas at Austin.

Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.

Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.

Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.

Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.

More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.

Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.

Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.

Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.

Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.

Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.

Absolute Fast Converging Methods CS 598 Algorithmic Computational Genomics.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

CS 466 and BIOE 498: Introduction to Bioinformatics

Constrained Exact Optimization in Phylogenetics

Distance-based phylogeny estimation

The Disk-Covering Method for Phylogenetic Tree Reconstruction

Advances in Ultra-large Phylogeny Estimation

Phylogenetic basis of systematics

New Approaches for Inferring the Tree of Life

394C, Spring 2012 Jan 23, 2012 Tandy Warnow.

CS 581 / BIOE 540: Algorithmic Computational Genomics

Statistical tree estimation

Distance based phylogenetics

Multiple Sequence Alignment Methods

Tandy Warnow Department of Computer Sciences

Challenges in constructing very large evolutionary trees

Techniques for MSA Tandy Warnow.

Algorithm Design and Phylogenomics

CIPRES: Enabling Tree of Life Projects

Mathematical and Computational Challenges in Reconstructing Evolution

New methods for simultaneous estimation of trees and alignments

Large-Scale Multiple Sequence Alignment

Mathematical and Computational Challenges in Reconstructing Evolution

Absolute Fast Converging Methods

CS 581 Tandy Warnow.

CS 581 Algorithmic Computational Genomics

Tandy Warnow Department of Computer Sciences

New methods for simultaneous estimation of trees and alignments

Texas, Nebraska, Georgia, Kansas

Recent Breakthroughs in Mathematical and Computational Phylogenetics

CS 394C: Computational Biology Algorithms

September 1, 2009 Tandy Warnow

Taxonomic identification and phylogenetic profiling

Algorithms for Inferring the Tree of Life

Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.

Tandy Warnow The University of Texas at Austin

Tandy Warnow The University of Texas at Austin

New methods for simultaneous estimation of trees and alignments

Ultra-large Multiple Sequence Alignment

New methods for estimating species trees from gene trees

TIPP and SEPP (plus PASTA)

Presentation transcript:

Ultra-Large Phylogeny Estimation Using SATé and DACTAL Tandy Warnow Department of Computer Science The University of Texas at Austin

Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

How did life evolve on earth? An international effort to understand how life evolved on earth Biomedical applications: drug design, protein structure and function prediction, biodiversity. Courtesy of the Tree of Life project

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Markov Model of Site Evolution Simplest (Jukes-Cantor): The model tree T is binary and has substitution probabilities p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleotides) If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Distance-based estimation

Performance on large diameter trees Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Theorem (Erdos et al., Atteson): Neighbor joining (and some other methods) will return the true tree w.h.p. provided sequence lengths are exponential in the evolutionary diameter of the tree. Sketch of proof: NJ (and other distance methods) guaranteed correct if all entries in the estimated distance matrix have low error Estimations of large distances require long sequences to have low error w.h.p.

DCM1-boosting: Warnow, St. John, and Moret, SODA 2001 Absolute fast converging (DCM1-boosted) method Exponentially converging (base) method DCM1 SQS The DCM1 phase produces a collection of trees (one for each threshold), and the SQS phase picks the “best” tree. For a given threshold, the base method is used to construct trees on small subsets (defined by the threshold) of the taxa. These small trees are then combined into a tree on the full set of taxa.

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem (Warnow et al., SODA 2001): DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Summary and Open Questions DCM-NJ has better accuracy than NJ, and DCM-boosting of other distance-based method also produces very big improvements in accuracy Other afc methods have been developed with even better theoretical performance (see work by Daskalakis and Roch, and others) Roch and collaborators have established a threshold for branch lengths, below which logarithmic sequence lengths can suffice for accuracy Still to be developed: other afc methods with improved empirical performance compared to NJ and other methods Interesting open problem: sequence length requirement for maximum likelihood (though see Szekely and Steel’s work)

What about more complex models? These results only apply when sequences evolve under these nice substitution-only models. What can we say about estimating trees when sequences evolve with insertions and deletions (“indels”)?

Today’s talk: some theory, some empirical performance SATé: Simultaneous Alignment and Tree Estimation (Liu et al., Science 2009, and Liu et al. Systematic Biology, in press), and DACTAL: Divide-and-Conquer Trees without alignments (Nelesen et al., submitted)

…ACGGTGCAGTTACCA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree Homology = nucleotides lined up since they come from a common ancestor. Indel = dash. 17

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Many methods Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Alignment methods Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Opal FSA (new method) Infernal (new method) Etc. RAxML: best heuristic for large-scale ML optimization 21

1000 taxon models, ordered by difficulty 1. 2 classes of MC: easy, moderate-to-difficult 2. true alignment 3. 2 classes: ClustalW, everything else Alignment error, measured this way, isn't a perfect predictor of tree error, measured this way. 1000 taxon models, ordered by difficulty 22

Problems with the two-phase approach Current alignment methods fail to return reasonable alignments on large datasets with high rates of indels and substitutions. Manual alignment is time consuming and subjective. Systematists discard potentially useful markers if they are difficult to align. This issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)

SATé Liu, Nelesen, Raghavan, Linder, and Warnow, Science, 19 June 2009, pp. 1561-1564. Kansas SATé software developers: Mark Holder, Jiaye Yu, Jeet Sukumaran, and Siavash Mirarab Downloadable software for various platforms Easy-to-use GUI http://phylo.bio.ku.edu/software/sate/sate.html

SATé Algorithm Tree Obtain initial alignment and estimated ML tree

SATé Algorithm Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Alignment

SATé Algorithm Obtain initial alignment and estimated ML tree Tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment

SATé Algorithm Obtain initial alignment and estimated ML tree Tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment If new alignment/tree pair has worse ML score, realign using a different decomposition Repeat until termination condition (typically, 24 hours)

One SATé iteration (really 32 subsets) D C Merge subproblems Estimate ML tree on merged alignment Decompose based on input tree Align subproblems ABCD e

1000 taxon models, ordered by difficulty Say the only thing changing here is the alignment method. 30

1000 taxon models, ordered by difficulty For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? 24 hour SATé analysis, on desktop machines (Similar improvements for biological datasets)

1000 taxon models ranked by difficulty

DACTAL BLAST-based Existing Method: RAxML(MAFFT) pRecDCM3 Unaligned Sequences Overlapping subsets pRecDCM3 A tree for each subset New supertree method: SuperFine A tree for the entire dataset

DACTAL vs. SATé 16S.T, 7350 rRNA sequences

Average of Three Largest CRW Datasets Datasets with curated alignments based upon secondary structure with 6323 to 27,643 sequences (16S.B.ALL, 16S.T, and16S.3). Reference trees are 75% RAxML bootstrap trees DACTAL run with at most 5 iterations from FastTree(PartTree) Observations: Quicktree and PartTree the only alignment methods that run on all three datasets DACTAL is robust to starting tree (same final accuracy results from worse starting trees)

Observations SATé and DACTAL outperform two-phase methods with respect to topological accuracy on large, hard-to-align datasets. DACTAL outperforms SATé on the largest datasets. We do not have any theoretical explanation for why these methods perform well.

A tree for the entire dataset DACTAL Any decomposition Any tree estimation method Unaligned Sequences Overlapping subsets Any decomposition A tree for each subset Any supertree method A tree for the entire dataset

Implications Divide-and-conquer methods can greatly improve the accuracy and speed of phylogeny and alignment estimation. Theoretical performance doesn’t predict empirical performance. Many open questions result from considering phylogeny estimation with indels.

Some open questions What is the sequence length requirement for maximum likelihood? Are trees identifiable under models including long gaps? Why do SATé and DACTAL perform well? Under standard implementations of ML, gaps are treated as missing data: what are the consequences?

Acknowledgments Microsoft Research New England National Science Foundation: Assembling the Tree of Life (ATOL), ITR, and IGERT grants (0733029, 0331453, 0114387) The John P. Simon Guggenheim Foundation Collaborators: Randy Linder, Bernard Moret, Mark Holder, Jiaye Yu, Alexis Stamatakis, Mike Steel, Katherine St. John, Peter Erdos, Laszlo Szekely, Kevin Liu, Luay Nakhleh, Serita Nelesen, Sindhu Raghavan, Usman Roshan, Jerry Sun, Rahul Suri, Shel Swenson, and Li-San Wang.

DACTAL vs. 2-phase methods