New methods for simultaneous estimation of trees and alignments

Slides:



Advertisements
Similar presentations
New methods for simultaneous estimation of trees and alignments Tandy Warnow The University of Texas at Austin.
Advertisements

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Molecular Evolution Revised 29/12/06
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
The Tree of Life: Algorithmic and Software Challenges Tandy Warnow The University of Texas at Austin.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Distance-based phylogeny estimation
Advances in Ultra-large Phylogeny Estimation
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CS 581 / BIOE 540: Algorithmic Computational Genomics
Statistical tree estimation
Distance based phylogenetics
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
The ideal approach is simultaneous alignment and tree estimation.
Challenges in constructing very large evolutionary trees
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
CIPRES: Enabling Tree of Life Projects
Mathematical and Computational Challenges in Reconstructing Evolution
Mathematical and Computational Challenges in Reconstructing Evolution
BNFO 602 Phylogenetics Usman Roshan.
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
Tandy Warnow Department of Computer Sciences
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
Recent Breakthroughs in Mathematical and Computational Phylogenetics
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Presentation transcript:

New methods for simultaneous estimation of trees and alignments Tandy Warnow The University of Texas at Austin

How did life evolve on earth? An international effort to understand how life evolved on earth Biomedical applications: drug design, protein structure and function prediction, biodiversity. Courtesy of the Tree of Life project

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

Standard Markov models Sequences evolve just with substitutions Sites (i.e., positions) evolve identically and independently, and have “rates of evolution” that are drawn from a common distribution (typically gamma) Numerical parameters describe the probability of substitutions of each type on each edge of the tree

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

DCM1-boosting distance-based methods [Nakhleh et al. ISMB 2001] Theorem: DCM1-NJ converges to the true tree from polynomial length sequences 0.8 NJ DCM1-NJ 0.6 Error Rate 0.4 0.2 400 800 1200 1600 No. Taxa

Maximum Likelihood (ML) Given: Set S of aligned DNA sequences, and a parametric model of sequence evolution Objective: Find tree T and numerical parameter values (e.g, substitution probabilities) so as to maximize the probability of the data. NP-hard Statistically consistent for standard models if solved exactly

But solving this problem exactly is … unlikely # of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900

Fast ML heuristics RAxML (Stamatakis) with bootstrapping GARLI (Zwickl) Rec-I-DCM3 boosting (Roshan et al.) of RAxML to allow analyses of datasets with thousands of sequences All available on the CIPRES portal (http://www.phylo.org)

We have excellent maximum likelihood software, and We have excellent mathematical theory about estimation under Markov models of evolution. Is phylogenetic estimation “solved”?

U V W X Y AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT U V W X Y

How can we reconstruct phylogenies from sequences of unequal length? Phylogenetic reconstruction methods assume the sequences all have the same length. Standard models of sequence evolution used in maximum likelihood and Bayesian analyses assume sequences evolve only via substitutions, producing sequences of equal length. And yet, almost all nucleotide datasets evolve with insertions and deletions (“indels”), producing datasets that violate these models and methods. How can we reconstruct phylogenies from sequences of unequal length?

Roadmap for Today How it’s currently done How it might be done How we’re doing it (and how well) Where we’re going with it

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA…

Indels and substitutions at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA…

Deletion Mutation …ACGGTGCAGTTACCA… …AC----CAGTCACCA… The true pairwise alignment is: …ACGGTGCAGTTACCA… …AC----CAGTCACCA… …ACGGTGCAGTTACCA… …ACCAGTCACCA… The true multiple alignment on a set of homologous sequences is obtained by tracing their evolutionary history, and extending the pairwise alignments on the edges to a multiple alignment on the leaf sequences.

U V W X Y AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT U V W X Y

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Multiple Sequence Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

So many methods!!! Phylogeny method Bayesian MCMC Maximum parsimony Alignment method Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Satchmo Etc. Blue = used by systematists Purple = recommended by protein research community Phylogeny method Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining UPGMA Quartet puzzling Etc.

So many methods!!! Phylogeny method Bayesian MCMC Maximum parsimony Alignment method Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Satchmo Etc. Blue = used by systematists Purple = recommended by protein research community Phylogeny method Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining UPGMA Quartet puzzling Etc.

So many methods!!! Phylogeny method Bayesian MCMC Maximum parsimony Alignment method Clustal POY (and POY*) Probcons (and Probtree) MAFFT Prank Muscle Di-align T-Coffee Satchmo Etc. Blue = used by systematists Purple = recommended by Edgar and Batzoglou for protein alignments Phylogeny method Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining UPGMA Quartet puzzling Etc.

Basic Questions Does improving the alignment lead to an improved phylogeny? Are we getting good enough alignments from MSA methods? (In particular, is ClustalW - the usual method used by systematists - good enough?) Are we getting good enough trees from the phylogeny reconstruction methods? Can we improve these estimations, perhaps through simultaneous estimation of trees and alignments?

Easy Sequence Alignment B_WEAU160 ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGG 45 A_U455 .............................A.....G......... 45 A_IFA86 ...................................G......... 45 A_92UG037 ...................................G......... 45 A_Q23 ...................C...............G......... 45 B_SF2 ............................................. 45 B_LAI ............................................. 45 B_F12 ............................................. 45 B_HXB2R ............................................. 45 B_LW123 ............................................. 45 B_NL43 ............................................. 45 B_NY5 ............................................. 45 B_MN ............C........................C....... 45 B_JRCSF ............................................. 45 B_JRFL ............................................. 45 B_NH52 ........................G.................... 45 B_OYI ............................................. 45 B_CAM1 ............................................. 45

Harder Sequence Alignment B_WEAU160 ATGAGAGTGAAGGGGATCAGGAAGAATTATCAGCACTTG 39 A_U455 ..........T......ACA..G........CTTG.... 39 A_SF1703 ..........T......ACA..T...C.G...AA....A 39 A_92RW020.5 ......G......ACA..C..G..GG..AA..... 35 A_92UG031.7 ......G.A....ACA..G.....GG........A 35 A_92UG037.8 ......T......AGA..G........CTTG..G. 35 A_TZ017 ..........G..A...G.A..G............A..A 39 A_UG275A ....A..C..T.....CACA..T.....G...AA...G. 39 A_UG273A .................ACA..G.....GG......... 39 A_DJ258A ..........T......ACA...........CA.T...A 39 A_KENYA ..........T.....CACA..G.....G.........A 39 A_CARGAN ..........T......ACA............A...... 39 A_CARSAS ................CACA.........CTCT.C.... 39 A_CAR4054 .............A..CACA..G.....GG..CA..... 39 A_CAR286A ................CACA..G.....GG..AA..... 39 A_CAR4023 .............A.---------..A............ 30 A_CAR423A .............A.---------..A............ 30 A_VI191A .................ACA..T.....GG..A...... 39

Simulation study 100 taxon model trees (generated by r8s and then modified, so as to deviate from the molecular clock). DNA sequences evolved under ROSE (indel events of blocks of nucleotides, plus HKY site evolution). The root sequence has 1000 sites. We varied the gap length distribution, probability of gaps, and probability of substitutions, to produce 8 model conditions: models 1-4 have “long gaps” and 5-8 have “short gaps”. We estimated maximum likelihood trees (using RAxML) on various alignments (including the true alignment). We evaluated estimated trees for topological accuracy using the Missing Edge rate.

DNA sequence evolution Simulation using ROSE: 100 taxon model trees, models 1-4 have “long gaps”, and 5-8 have “short gaps”, site substitution is HKY+Gamma

DNA sequence evolution Simulation using ROSE: 100 taxon model trees, models 1-4 have “long gaps”, and 5-8 have “short gaps”, site substitution is HKY+Gamma

Two problems with two-phase methods All current methods for multiple alignment have high error rates when sequences evolve with many indels and substitutions. All current methods for phylogeny estimation treat indel events inadequately (either treating as missing data, or giving too much weight to each gap).

What about “simultaneous estimation”? V W X Y AGTGGAT TATGCCCA TATGACTT AGCCCTA AGCCCGCTT U V W X Y What about “simultaneous estimation”?

Simultaneous Estimation Statistical methods (e.g., AliFritz and BaliPhy) cannot be applied to datasets above ~20 sequences. POY attempts to solve the NP-hard “minimum treelength” problem, and can be applied to larger datasets. Somewhat equivalent to maximum parsimony Sensitive to gap treatment, but even with very good gap treatments is only comparable to good two-phase methods in accuracy (while not as accurate as the better ones), and takes a long time to reach local optima

Goals Current: Methods for simultaneous estimation of trees and alignments which produce more accurate phylogenies and multiple alignments on difficult-to-align markers Which can analyze large datasets (tens of thousands of sequences) quickly Runs on a desktop computer As a consequence, increase the set of markers that can be used in phylogenetic studies Long term: Develop a maximum likelihood method for simultaneous estimation of alignments and trees incorporating insertions and deletions in the model.

SATé: (Simultaneous Alignment and Tree Estimation) Developers: Liu, Nelesen, Raghavan, Linder, and Warnow Search strategy: search through tree space, and realigns sequences on each tree using a novel divide-and-conquer approach. Optimization criterion: alignment/tree pair that optimizes maximum likelihood under GTR+Gamma+I. Unpublished (but to be submitted shortly)

SATé Algorithm (unpublished) SATé keeps track of the maximum likelihood scores of the tree/alignment pairs it generates, and returns the best pair it finds Obtain initial alignment and estimated ML tree T T Use new tree (T) to compute new alignment (A) Estimate ML tree on new alignment A

Simulation study using ROSE 100, 500, and 1000 sequences Sequence at the root has 1000 sites Model of evolutio is GTR+Gamma+indels Three gap length distributions (short, medium, and long) Varying rates of substitution and indels

Results 100 taxon simulated datasets Missing edge rates Alignment error rates (SP-FN) Empirical statistics

Results 500 taxon simulated datasets Missing edge rates Alignment error rates (SP-FN) Empirical statistics

Results 1000 taxon simulated datasets Missing edge rates Alignment error rates (SP-FN) Empirical statistics

Biological datasets Used ML analyses of curated alignments (8 produced by Robin Gutell, others from the Early Bird ATOL project, and some from UT faculty) Computed several alignments and maximum likelihood trees on each alignment, and SATe trees and alignments. Compared alignments and trees to the curated alignment and to the reference tree (75% bootstrap ML tree on the curated alignment)

Asteraceae ITS The curated alignment consists of 328 ITS sequences drawn from the Asteraceae family (Goertzen et al. 2003). Empirical statistics: 36% ANHD 79% MNHD 23% gapped

Conclusions SATé produces trees and alignments that improve upon the best two-phase methods for “hard to align” datasets, and can do so in reasonable time frames (24 hours) on desktop computers Further improvement is obtained with longer analyses We conjecture that better results would be obtained by ML under models that include indel processes (ongoing work)

Acknowledgements Funding: NSF, The Program in Evolutionary Dynamics at Harvard, and The Institute for Cellular and Molecular Biology at UT-Austin. Collaborators: Randy Linder (Integrative Biology, UT-Austin) Students Kevin Liu, Serita Nelesen, and Sindhu Raghavan