Advances in Ultra-large Phylogeny Estimation

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Family of HMMs Nam Nguyen University of Texas at Austin.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
Distance-based phylogeny estimation
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CS 581 / BIOE 540: Algorithmic Computational Genomics
Chalk Talk Tandy Warnow
Distance based phylogenetics
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Mathematical and Computational Challenges in Reconstructing Evolution
New methods for simultaneous estimation of trees and alignments
Large-Scale Multiple Sequence Alignment
Mathematical and Computational Challenges in Reconstructing Evolution
TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Tandy Warnow Founder Professor of Engineering
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Ultra-large Multiple Sequence Alignment
New methods for estimating species trees from gene trees
TIPP and SEPP (plus PASTA)
Presentation transcript:

Advances in Ultra-large Phylogeny Estimation Tandy Warnow Department of Computer Science University of Texas

Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

How did life evolve on earth? Courtesy of the Tree of Life project

Where did humans come from, and how did they move throughout the globe? The 1000 Genome Project: using human genetic variation to better treat diseases

Metagenomics: C. Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Major Challenges Current phylogenetic datasets contain hundreds to thousands of taxa, with multiple genes. Future datasets will be substantially larger (e.g., iPlant plans to construct a tree on 500,000 plant species) Current methods have poor accuracy or cannot run on large datasets.

Computational Phylogenetics Current methods can use months to estimate trees on 1000 DNA sequences Our objective: More accurate trees and alignments on 500,000 sequences in under a week We prove theorems using graph theory and probability theory, and our algorithms are studied on real and simulated data. Courtesy of the Tree of Life project

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

…ACGGTGCAGTTACCA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree Homology = nucleotides lined up since they come from a common ancestor. Indel = dash. 9

U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT X U Y V W

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Simulation Studies Unaligned Sequences Compare True tree and alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S2 S3 S4 S1 S4 S3 S2 Compare True tree and alignment Estimated tree and alignment

The neighbor joining method has high error rates on large trees Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

1000 taxon models, ordered by difficulty (Liu et al., 2009) 1. 2 classes of MC: easy, moderate-to-difficult 2. true alignment 3. 2 classes: ClustalW, everything else Alignment error, measured this way, isn't a perfect predictor of tree error, measured this way. 1000 taxon models, ordered by difficulty (Liu et al., 2009) 16

Problems Large datasets with high rates of evolution are hard to align accurately, and phylogeny estimation methods produce poor trees when alignments are poor. Many phylogeny estimation methods have poor accuracy on large datasets (even if given correct alignments) Potentially useful genes are often discarded if they are difficult to align. These issues seriously impact large-scale phylogeny estimation (and Tree of Life projects)

Major Challenges Current phylogenetic datasets contain hundreds to thousands of taxa, with multiple genes. Future datasets will be substantially larger (e.g., iPlant plans to construct a tree on 500,000 plant species) Current methods have poor accuracy or cannot run on large datasets.

Phylogenetic “boosters” (meta-methods) Goal: improve accuracy, speed, robustness, or theoretical guarantees of base methods Examples: DCM-boosting for distance-based methods (1999) DCM-boosting for heuristics for NP-hard problems (1999) SATé-boosting for alignment methods (2009) SuperFine-boosting for supertree methods (2011) DACTAL-boosting for all phylogeny estimation methods (2011) SEPP-boosting for metagenomic analyses (2011)

Disk-Covering Methods (DCMs) (starting in 1998)

DCMs “boost” the performance of phylogeny reconstruction methods. Base method M DCM-M

The neighbor joining method has high error rates on large trees Simulation study based upon fixed edge lengths, K2P model of evolution, sequence lengths fixed to 1000 nucleotides. Error rates reflect proportion of incorrect edges in inferred trees. [Nakhleh et al. ISMB 2001] 0.8 NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem: DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Today’s Talk SATé: Simultaneous Alignment and Tree Estimation (Liu et al., Science 2009, and Liu et al. Systematic Biology, in press) DACTAL: Divide-and-Conquer Trees without alignments (Nelesen et al., submitted)

Part 1: SATé Liu, Nelesen, Raghavan, Linder, and Warnow, Science, 19 June 2009, pp. 1561-1564. Liu et al., Systematic Biology (in press) Public software distribution (open source) through the University of Kansas, in use, world-wide

SATé Algorithm Tree Obtain initial alignment and estimated ML tree

SATé Algorithm Obtain initial alignment and estimated ML tree Tree Use tree to compute new alignment Alignment

SATé Algorithm Obtain initial alignment and estimated ML tree Tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment

SATé Algorithm Obtain initial alignment and estimated ML tree Tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment If new alignment/tree pair has worse ML score, realign using a different decomposition Repeat until termination condition (typically, 24 hours)

One SATé iteration (really 32 subsets) D C Merge subproblems Estimate ML tree on merged alignment Decompose based on input tree Align subproblems ABCD e

1000 taxon models, ordered by difficulty Say the only thing changing here is the alignment method. 31

1000 taxon models, ordered by difficulty For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? 24 hour SATé analysis, on desktop machines (Similar improvements for biological datasets)

1000 taxon models ranked by difficulty

Limitations of SATé-I and -II B D C A B Decompose dataset C D Align subproblems A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

Part II: DACTAL (Divide-And-Conquer Trees (Almost) without alignments) Input: set S of unaligned sequences Output: tree on S (but no alignment) (Nelesen, Liu, Wang, Linder, and Warnow, submitted)

DACTAL BLAST-based Existing Method: RAxML(MAFFT) pRecDCM3 Unaligned Sequences Overlapping subsets pRecDCM3 A tree for each subset New supertree method: SuperFine A tree for the entire dataset

Average of 3 Largest CRW Datasets CRW: Comparative RNA database, Three 16S datasets with 6,323 to 27,643 sequences Reference alignments based on secondary structure Reference trees are 75% RAxML bootstrap trees DACTAL (shown in red) run for 5 iterations starting from FT(Part) FastTree (FT) and RAxML are ML methods

Observations DACTAL gives more accurate trees than all other methods on the largest datasets DACTAL is much faster than SATé DACTAL is robust to starting trees and other algorithmic parameters

Summary Standard alignment and phylogeny estimation methods do not provide adequate accuracy on large datasets, and NGS data present novel challenges When markers tend to yield poor alignments and trees, develop better methods - don’t throw out the data.

Current Research Projects Method development: Large-scale multiple sequence alignment and phylogeny estimation Metagenomics Comparative genomics Estimating species trees from gene trees Supertree methods Phylogenetic estimation under statistical models Dataset analyses (multi-institutional collaborations): Avian Phylogeny (and brain evolution) Human Microbiome Thousand Transcriptome (1KP) Project Conifer evolution

Acknowledgments Guggenheim Foundation Fellowship, Microsoft Research New England, National Science Foundation: Assembling the Tree of Life (ATOL), ITR, and IGERT grants, and David Bruton Jr. Professorship Collaborators: SATé: Kevin Liu, Serita Nelesen, Sindhu Raghavan, and Randy Linder DACTAL: Serita Nelesen, Li-San Wang, and Randy Linder

Part III: SEPP SEPP: SATé-enabled Phylogenetic Placement, by Mirarab, Nguyen, and Warnow To appear, Pacific Symposium on Biocomputing, 2012 (special session on the Human Microbiome)

Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown species Taxon identification: given short sequences, identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed

Metagenomics Input: set of sequences Output: a tree on the set of sequences, indicating the species identification of each sequence Issue: the sequences are not globally alignable, and there are often thousands (or more) of the sequences

Phylogenetic Placement Input: Backbone alignment and tree on full- length sequences, and a set of query sequences (short fragments) Output: Placement of query sequences on backbone tree Phylogenetic placement can be used for taxon identification, but it has general applications for phylogenetic analyses of NGS data. The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model profile for the backbone alignment, and then aligns the query sequence to the model, and PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Phylogenetic Placement Align each query sequence to backbone alignment Place each query sequence into backbone tree, using extended alignment The two basic steps in phylogenetic placement is similar to the two phase methods we typically use, align and then place the fragment. The differences are we align each fragment independently into the backbone tree. We then take each extended alignment and place each fragment independently into the backbone tree. Two methods to align the query sequences are HMMALIGN, which estimates a hidden markov model on backbone alignemnt, then aligns query sequence. PaPaRa, which estimates ancestoral parsimony state vectors for each each in the reference tree and aligns the query sequence against each state vector. It selects the best scoring alignment. Two methods that place the query sequence try to optimize the same criterion: find the ML placement given the extended alignment

Align Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC S1 S2 S3 S4

Align Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4

Place Sequence S1 S2 S3 S4 S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- S1 S2 S3 S4 Q1

Phylogenetic Placement Align each query sequence to backbone alignment HMMALIGN (Eddy, Bioinformatics 1998) PaPaRa (Berger and Stamatakis, Bioinformatics 2011) Place each query sequence into backbone tree Pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011) Note: pplacer and EPA use maximum likelihood

HMMER vs. PaPaRa Alignments 0.0 Increasing rate of evolution

Insights from SATé

Insights from SATé

Insights from SATé

Insights from SATé

Insights from SATé

SEPP Parameter Exploration Alignment subset size and placement subset size impact the accuracy, running time, and memory of SEPP 10% rule (subset sizes 10% of backbone) had best overall performance

SEPP (10%-rule) on simulated data 0.0 0.0 Increasing rate of evolution

SEPP (10%) on Biological Data 16S.B.ALL dataset, 13k curated backbone tree, 13k total fragments

SEPP (10%) on Biological Data 16S.B.ALL dataset, 13k curated backbone tree, 13k total fragments For 1 million fragments: PaPaRa+pplacer: ~133 days HMMALIGN+pplacer: ~30 days SEPP 1000/1000: ~6 days

Three “Boosters” SATé: co-estimation of alignments and trees DACTAL: large trees without full alignments SEPP: phylogenetic analysis of fragmentary data Algorithmic strategies: divide-and-conquer and iteration to improve the accuracy and scalability of a base method

Red gene tree ≠ species tree (green gene tree okay)

Multi-marker species tree estimation Species phylogenies are estimated using multiple gene trees. Most methods assume that all gene trees are identical to the species tree. This is known to be unrealistic in some situations, due to processes such as Deep Coalescence Gene duplication and loss Horizontal gene transfer MDC problem: Given set of gene trees, find a species tree that minimizes the total number of “deep coalescences”.

Yu, Warnow and Nakhleh, 2011 Previous software for MDC assumed all gene trees are correct, completely resolved, and rooted. Our methods allow for error in estimated gene trees. We provide exact algorithms and heuristics to find an optimal species tree with respect to a given set of partially resolved, unrooted gene trees, minimizing the total number of deep coalescences. Software at http://bioinfo.cs.rice.edu/phylonet/ To appear, RECOMB 2011 and J. Computational Biology, special issue for RECOMB 2011. Talk about this topic today at 2 PM in OEB.

Markov Model of Site Evolution Simplest (Jukes-Cantor): The model tree T is binary and has substitution probabilities p(e) on each edge e. The state at the root is randomly drawn from {A,C,T,G} (nucleotides) If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

SATé-I vs. SATé-II SATé-II Faster and more accurate than SATé-I Longer analyses or use of ML to select tree/alignment pair slightly better results

Divergence & Information Content Average Pairwise Sequence Divergence Percent Informative Sites 100 50 25 75 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Exons UTRs Introns The rate of evolution is important, because as sequence divergence increases, so does the amount of phylogenetic information the sequences can contain. Protein-coding exons, which rarely reach average pairwise sequence divergence above 10% in our data, generally have between 15 and 30% of sites that are phylogenetically informative. For introns, with 10-30% sequence divergence, the percentage of informative sites is much higher. The UTRs we have looked at tend to fall in between. Analysis and figure provided by Mike Braun Smithsonian Institution 67

Reticulate evolution Not all evolution is tree-like: Horizontal gene transfer Hybrid speciation How can we detect reticulate evolution?

U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Two-phase estimation Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. RAxML: heuristic for large-scale ML optimization 71

Software In use by research groups around the world Kansas SATé software developers: Mark Holder, Jiaye Yu, and Jeet Sukumaran Downloadable software for various platforms Easy-to-use GUI http://phylo.bio.ku.edu/software/sate/sate.html

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Understanding SATé Observations: (1) subsets of taxa that are small enough, closely related, and densely sampled are aligned more accurately than others. SATé-1 produces subsets that are closely related and densely sampled, but not small enough. SATé-2 (“next SATé”) changes the design to produce smaller subproblems. The next iteration starts with a more accurate tree. This leads to a better alignment, and a better tree.

Biology: 21st Century Science! “When the human genome was sequenced seven years ago, scientists knew that most of the major scientific discoveries of the 21st century would be in biology.” January 1, 2008, guardian.co.uk

Genome Sequencing Projects: Started with the Human Genome Project

Whole Genome Sequencing: Graph Algorithms and Combinatorial Optimization!

Other Genome Projects! (Neandertals, Wooly Mammoths, and more ordinary creatures…)

Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona