New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

Slides:



Advertisements
Similar presentations
CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.
Advertisements

A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.
Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.
Recent breakthroughs in mathematical and computational phylogenetics
Supertrees and the Tree of Life
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant
Computational Phylogenomics and Metagenomics Tandy Warnow Departments of Bioengineering and Computer Science The University of Illinois at Urbana-Champaign.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
Estimating Species Tree from Gene Trees by Minimizing Duplications
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Understanding sets of trees CS 394C September 10, 2009.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
SupreFine, a new supertree method Shel Swenson September 17th 2009.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
The Big Issues in Phylogenetic Reconstruction Randy Linder Integrative Biology, University of Texas
The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
CS 581 / BIOE 540: Algorithmic Computational Genomics
Chalk Talk Tandy Warnow
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Mathematical and Computational Challenges in Reconstructing Evolution
Tandy Warnow The University of Illinois
Large-Scale Multiple Sequence Alignment
Mathematical and Computational Challenges in Reconstructing Evolution
Tandy Warnow Founder Professor of Engineering
CS 394C: Computational Biology Algorithms
Taxonomic identification and phylogenetic profiling
Algorithms for Inferring the Tree of Life
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
Scaling Species Tree Estimation to Large Datasets
Presentation transcript:

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Phylogeny (evolutionary tree)

Phylogenomics = genome-scale phylogeny estimation Note: Jonathan Eisen coined this term, but used it to mean something else.

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignment and “gene tree” for each locus Compute species tree or phylogenetic network from the gene trees or alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Avian Phylogenomics Project E Jarvis, HHMI G Zhang, BGI Approx. 50 species, whole genomes genes, UCEs MTP Gilbert, Copenhagen S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin T. Warnow UT-Austin Plus many many other people… Jarvis, Mirarab, et al., Science 2014

1KP: Thousand Transcriptome Project 1200 plant transcriptomes More than 13,000 gene families (most not single copy) G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin Wickett, Mirarab, et al., Proceedings of the National Academy of Sciences, 2014

Alternate title: “A Tale of Two Studies…” Tandy Warnow The University of Illinois

This talk Gene tree estimation and statistical consistency Gene tree conflict due to incomplete lineage sorting The multi-species coalescent model – Identifiability and statistical consistency The challenge of gene tree estimation error New methods for species tree estimation – Statistical Binning (Science 2014) – ASTRAL (ECCB 2014 and ISMB 2015)

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Phylogeny (evolutionary tree)

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Sampling multiple genes from multiple species

Gene Tree Incongruence Gene trees can differ from the species tree due to: – Duplication and loss – Horizontal gene transfer – Incomplete lineage sorting (ILS)

Gene trees inside the species tree (Coalescent Process) Present Past Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Incomplete Lineage Sorting (ILS) papers in 2013 alone Confounds phylogenetic analysis for many groups: – Hominids – Birds – Yeast – Animals – Toads – Fish – Fungi There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS.

... Analyze separately Summary Method Two competing approaches gene 1 gene 2... gene k... Concatenation Species

Statistically consistent under ILS? CA-ML (Concatenation using maximum likelihood) - NO MDC – NO GC (Greedy Consensus) – NO MRP (supertree method) – NO Coalescent-based summary methods: – MP-EST (Liu et al. 2010): maximum likelihood estimation of rooted species tree – YES – BUCKy-pop (Ané and Larget 2010): quartet-based Bayesian species tree estimation –YES Co-estimation methods: *BEAST (Heled and Drummond 2009): Bayesian co-estimation of gene trees and species trees - YES

Results on 11-taxon datasets with weak ILS * BEAST more accurate than summary methods (MP-EST, BUCKy, etc) CA-ML (concatenated analysis) most accurate Datasets from Chung and Ané, 2011 Bayzid & Warnow, Bioinformatics 2013

Impact of Gene Tree Estimation Error on MP-EST MP-EST has no error on true gene trees, but MP-EST has 9% error on estimated gene trees Datasets: 11-taxon strongILS conditions with 50 genes Similar results for other summary methods (MDC, Greedy, etc.).

Problem: poor gene trees Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy.

Problem: poor gene trees Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy.

Problem: poor gene trees Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy.

Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy. TYPICAL PHYLOGENOMICS PROBLEM: many poor gene trees

Summary methods combine estimated gene trees, not true gene trees. The individual gene sequence alignments in the 11-taxon datasets have poor phylogenetic signal, and result in poorly estimated gene trees. Species trees obtained by combining poorly estimated gene trees have poor accuracy. THIS IS A KEY ISSUE IN THE DEBATE ABOUT HOW TO COMPUTE SPECIES TREES

Avian Phylogenomics Project E Jarvis, HHMI G Zhang, BGI Approx. 50 species, whole genomes, 14,000 loci MTP Gilbert, Copenhagen S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin T. Warnow UT-Austin Plus many many other people… Challenges: Concatenation analysis used >200 CPU years But also Massive gene tree conflict suggestive of ILS, and most gene trees had very low bootstrap support Coalescent-based analysis using MP-EST produced tree that conflicted with concatenation analysis

Avian Phylogenomics Project E Jarvis, HHMI G Zhang, BGI Approx. 50 species, whole genomes, 14,000 loci MTP Gilbert, Copenhagen S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin T. Warnow UT-Austin Plus many many other people… Solution: New method to improve coalescent-based analysis: Statistical Binning (Mirarab, Bayzid, Boussau, and Warnow, Science 2014) Species tree estimated using Statistical Binning with MP-EST (Jarvis, Mirarab, et al., Science 2014)

Statistical binning vs. unbinned Mirarab, et al., Science 2014 Binning produces bins with approximate 5 to 7 genes each Datasets: 11-taxon strongILS datasets with 50 genes, Chung and Ané, Systematic Biology

Comparing Binned and Un-binned MP-EST on the Avian Dataset Unbinned MP-EST strongly rejects Columbea, a major finding by Jarvis, Mirarab,et al. Binned MP-EST is largely consistent with the ML concatenation analysis. The trees presented in Science 2014 were the ML concatenation and Binned MP-EST

Statistical Binning Statistical binning uses bootstrap support to group loci into “supergene” datasets, and computes supergene trees on these bins. These supergene trees can be used by a coalescent-based method to construct a tree. Weighted statistical binning is statistically consistent under the MSC; unweighted statistical binning is not! Simulations show both types of statistical binning improve accuracy of species tree topologies and branch lengths, under nearly all conditions we tested. (Exceptions: very small numbers of taxa with very high ILS.)

1KP: Thousand Transcriptome Project 103 plant transcriptomes (1200 in the next phase) single copy “genes” (more than 13,000 gene families, most not single copy, in the next phase) G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin Wickett, Mirarab, et al., Proceedings of the National Academy of Sciences, 2014 Challenges: Massive gene tree heterogeneity consistent with ILS Could not use MP-EST due to missing data (many gene trees could not be rooted) and large number of species

1KP: Thousand Transcriptome Project 103 plant transcriptomes (1200 in the next phase) single copy “genes” (more than 13,000 gene families, most not single copy, in the next phase) G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UT-Austin UT-Austin UT-Austin Wickett, Mirarab, et al., Proceedings of the National Academy of Sciences, 2014 Solution: New coalescent-based method ASTRAL ASTRAL is statistically consistent, polynomial time, and uses unrooted gene trees.

Maximum Quartet Support Tree Input: Set T ={t 1,t 2,…,t k } of unrooted gene trees, with each tree on taxon set S, and set X of allowed bipartitions Output: Unrooted tree T on leafset S, maximizing the total quartet tree similarity to T Theorem: An exact solution to this problem is statistically consistent under the MSC. Conjecture: MQST is NP-hard

Scientific challenges: Ultra-large multiple-sequence alignment Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Applications to metagenomics, protein structure and function prediction, etc. Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics The Tree of Life: Multiple Challenges

Acknowledgments PhD students: Siavash Mirarab* and Md. S. Bayzid** Funding: Guggenheim Foundation, NSF, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Computing Center), and Founder Professorship in Bioengineering. TACC and UTCS computational resources * Supported by HHMI Predoctoral Fellowship ** Supported by Fulbright Foundation Predoctoral Fellowship