Mathematical and Computational Challenges in Reconstructing Evolution

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Lecture 16: Wrap-Up COMP 538 Introduction of Bayesian networks.
Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Statistical stuff: models, methods, and performance issues CS 394C September 16, 2013.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Statistical stuff: models, methods, and performance issues CS 394C September 3, 2009.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
Advances in Ultra-large Phylogeny Estimation
Phylogenetic basis of systematics
New Approaches for Inferring the Tree of Life
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CS 581 / BIOE 540: Algorithmic Computational Genomics
Statistical tree estimation
Chalk Talk Tandy Warnow
Distance based phylogenetics
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
Challenges in constructing very large evolutionary trees
Algorithm Design and Phylogenomics
CIPRES: Enabling Tree of Life Projects
Tandy Warnow The University of Illinois
Large-Scale Multiple Sequence Alignment
Mathematical and Computational Challenges in Reconstructing Evolution
Absolute Fast Converging Methods
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
CS 581 Algorithmic Computational Genomics
Tandy Warnow Founder Professor of Engineering
Tandy Warnow Department of Computer Sciences
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
Tandy Warnow The University of Texas at Austin
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
New methods for estimating species trees from gene trees
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
Scaling Species Tree Estimation to Large Datasets
Presentation transcript:

Mathematical and Computational Challenges in Reconstructing Evolution Tandy Warnow The University of Illinois

Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

“Nothing in biology makes sense except in the light of evolution” Theodosius Dobzhansky, 1973 essay in the American Biology Teacher, vol. 35, pp 125-129 “…... nothing in evolution makes sense except in the light of phylogeny ...” Society of Systematic Biologists, http://systbio.org/teachevolution.html

This Talk Models of evolution, identifiability, statistical consistency Genome-scale phylogeny Open questions Future directions

DNA Sequence Evolution (Idealized) -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Markov Models of Sequence Evolution The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution).

Markov Models of Sequence Evolution The different sites are assumed to evolve i.i.d. down the model tree (with rates that are drawn from a gamma distribution). Simplest site evolution model (Jukes-Cantor, 1969): The model tree T is binary and has substitution probabilities p(e) on each edge e, with 0<p(e)<3/4. The state at the root is randomly drawn from {A,C,T,G} (nucleotides) If a site (position) changes on an edge, it changes with equal probability to each of the remaining states. The evolutionary process is Markovian. More complex models (such as the General Markov model) are also considered, often with little change to the theory.

Statistical Consistency error Data

Questions Is the model tree identifiable? Which estimation methods are statistically consistent under this model? How much data does the method need to estimate the model tree correctly (with high probability)?

Answers? We know a lot about which site evolution models are identifiable, and which methods are statistically consistent. We know a little bit about the sequence length requirements for standard methods. Take home message: need to limit (or not allow) heterogeneity to get identifiability!

Genome-scale data? error Data

Phylogeny + genomics = genome-scale phylogeny estimation .

Incomplete Lineage Sorting (ILS) is a dominant cause of gene tree heterogeneity

Gene trees inside the species tree (Coalescent Process) Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Gene trees inside the species tree (Coalescent Process) Deep coalescence = INCOMPLETE LINEAGE SORTING (ILS): gene tree can be different from the species tree Past Present Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

1KP: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin 103 plant transcriptomes, 400-800 single copy “genes” Next phase will be much bigger Wickett, Mirarab et al., PNAS 2014 Major Challenge: Massive gene tree heterogeneity consistent with ILS

Major challenge: Massive gene tree heterogeneity consistent with ILS.

Statistically consistent methods Coalescent-based summary methods: Estimate gene trees, and then combine together (ASTRAL, ASTRID, MP-EST, NJst, and others) Co-estimation methods: Co-estimate gene trees and species trees (TOO EXPENSIVE) Site-based methods: estimate the species tree from the concatenated alignment, and do not estimate gene trees (NOT WELL STUDIED)

Main competing approaches gene 1 gene 2 . . . gene k . . . Species Concatenation . . . Analyze separately point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”. Summary Method

What about summary methods? . . . point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”.

What about summary methods? . . . Techniques: Most frequent gene tree? Consensus of gene trees? Other? point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”.

Species tree estimation from unrooted gene trees Theorem (Allman et al.): Under the multi-species coalescent model, for any four taxa A, B, C, and D, the most probable unrooted gene tree on {A,B,C,D} is identical to the unrooted species tree induced on {A,B,C,D}. point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”.

Species tree estimation from unrooted gene trees Theorem (Allman et al.): Under the multi-species coalescent model, for any four taxa A, B, C, and D, the most probable unrooted gene tree on {A,B,C,D} is identical to the unrooted species tree induced on {A,B,C,D}. Proof: For every four species, select most frequently observed tree as the species tree. Then combine quartet trees! point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”.

Species tree estimation from unrooted gene trees Theorem (Allman et al.): Under the multi-species coalescent model, for any four taxa A, B, C, and D, the most probable unrooted gene tree on {A,B,C,D} is identical to the unrooted species tree induced on {A,B,C,D}. Proof: For every four species, select most frequently observed tree as the species tree. Then combine quartet trees! point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”.

Constrained Maximum Quartet Support Tree Input: Set T = {t1,t2,…,tk} of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipartitions Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipartitions from X. Theorems (Mirarab et al., 2014): If X contains the bipartitions from the input gene trees (and perhaps others), then an exact solution to this problem is statistically consistent under the MSC. The constrained MQST problem can be solved in O(|X|2nk) time. (We use dynamic programming, and build the unrooted tree from the bottom-up, based on “allowed clades” – halves of the allowed bipartitions.) Conjecture: MQST is NP-hard

Constrained Maximum Quartet Support Tree Input: Set T = {t1,t2,…,tk} of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipartitions Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipartitions from X. Theorems (Mirarab et al., 2014): If X contains the bipartitions from the input gene trees (and perhaps others), then an exact solution to this problem is statistically consistent under the MSC. The constrained MQST problem can be solved in O(|X|2nk) time. (We use dynamic programming, and build the unrooted tree from the bottom-up, based on “allowed clades” – halves of the allowed bipartitions.) Conjecture: MQST is NP-hard

Constrained Maximum Quartet Support Tree Input: Set T = {t1,t2,…,tk} of unrooted gene trees, with each tree on set S with n species, and set X of allowed bipartitions Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T, subject to T drawing its bipartitions from X. Theorems (Mirarab et al., 2014): If X contains the bipartitions from the input gene trees (and perhaps others), then an exact solution to this problem is statistically consistent under the MSC. The constrained MQST problem can be solved in O(|X|2nk) time. (We use dynamic programming, and build the unrooted tree from the bottom-up, based on “allowed clades” – halves of the allowed bipartitions.)

Used SimPhy, Mallo and Posada, 2015

Accuracy in the presence of HGT + ILS Davidson et al., RECOMB-CG, BMC Genomics 2015

Impact of Gene Tree Estimation Error (from Molloy and Warnow 2017) Error is fraction of bipartitions that are not recovered Summary Methods Site-based Method

Impact of Gene Tree Estimation Error (from Molloy and Warnow 2017) Note: Summary methods better than CA-ML for low GTEE, then worse! Error is fraction of bipartitions that are not recovered Summary Methods Site-based Method

Statistical Consistency for summary methods error Data Data are gene trees, presumed to be randomly sampled true gene trees.

Gene tree estimation error: key issue in the debate Multiple studies show that summary methods can be less accurate than concatenation in the presence of high gene tree estimation error. Genome-scale data includes a range of markers, not all of which have substantial signal. Furthermore, removing sites due to model violations reduces signal. Some researchers also argue that “gene trees” should be based on very short alignments, to avoid intra-locus recombination.

What about performance on bounded number of sites? Question #1: Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? Answer #1: Roch & Warnow, Syst Biol, March 2015: Strict molecular clock: Yes for some new methods, even for a single site per locus No clock: Unknown for all methods, including MP-EST, ASTRAL, etc. S. Roch and T. Warnow. "On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods", Systematic Biology, 64(4):663-676, 2015

What about performance on bounded number of sites? Question #1: Do any summary methods converge to the species tree as the number of loci increase, but where each locus has only a constant number of sites? Answer #2: Roch, Nute, & Warnow, accepted to Syst Biol. No! Summary methods are not only not consistent, they can be positively misleading! (Felsenstein Zone) S. Roch, M. Nute, and T. Warnow. "Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods.” arXiv:1803.02800

What about performance on bounded number of sites? Question #2: What about concatenation using maximum likelihood? Answer: Roch, Nute, & Warnow, accepted to Syst Biol. Not if fully partitioned! Concatenation using maximum likelihood, if fully partitioned is also not consistent and can be positively misleading (even if there is NO ILS)! (Felsenstein Zone) S. Roch, M. Nute, and T. Warnow. "Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods.” arXiv:1803.02800

Statistically consistent methods Coalescent-based summary methods: Estimate gene trees, and then combine together (ASTRAL, ASTRID, MP-EST, NJst, and others) Co-estimation methods: Co-estimate gene trees and species trees (TOO EXPENSIVE) Site-based methods: estimate the species tree from the concatenated alignment, and do not estimate gene trees (NOT WELL STUDIED)

Statistically consistent methods (??) Coalescent-based summary methods: Estimate gene trees, and then combine together (ASTRAL, ASTRID, MP-EST, NJst, and others) Co-estimation methods: Co-estimate gene trees and species trees (TOO EXPENSIVE) Site-based methods: estimate the species tree from the concatenated alignment, and do not estimate gene trees (NOT WELL STUDIED)

Future Directions Theory: Determine which species tree estimation methods are statistically consistent when the number of sites per locus is bounded, and heterogeneity between loci is not constrained Practice: Understand why concatenation performs well, and develop better coalescent-based methods

NP-hard optimization problems and large datasets Statistical estimation under stochastic models of evolution Probabilistic analysis of algorithms Graph-theoretic divide-and-conquer Chordal graph theory Combinatorial optimization

Acknowledgments Roch and Warnow, Systematic Biology 2014 Mirarab and Warnow, Bioinformatics 2015 Molloy and Warnow, Systematic Biology 2017 Papers available at http://tandy.cs.illinois.edu/papers.html Funding: NSF, Grainger Foundation, and HHMI (to SM).