Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.

Slides:

Advertisements

Similar presentations

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.

Advertisements

Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.

Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.

Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.

Recent breakthroughs in mathematical and computational phylogenetics

CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science

New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.

IPAM Workshop on Multiple Sequence Alignment Tandy Warnow The University of Illinois at Urbana-Champaign.

Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.

From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.

Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.

Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant

Computational Phylogenomics and Metagenomics Tandy Warnow Departments of Bioengineering and Computer Science The University of Illinois at Urbana-Champaign.

Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.

New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.

New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.

From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.

Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign

New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.

Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.

Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

394C, October 2, 2013 Topics: Multiple Sequence Alignment

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.

Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Family of HMMs Nam Nguyen University of Texas at Austin.

The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.

Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.

CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.

Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.

Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.

New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.

New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.

Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.

Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.

CS 466 and BIOE 498: Introduction to Bioinformatics

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Multiple Sequence Alignment Methods

Techniques for MSA Tandy Warnow.

Algorithm Design and Phylogenomics

Tandy Warnow The University of Illinois

Large-Scale Multiple Sequence Alignment

CS 581 Algorithmic Computational Genomics

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Tandy Warnow Founder Professor of Engineering

New methods for simultaneous estimation of trees and alignments

Taxonomic identification and phylogenetic profiling

New methods for simultaneous estimation of trees and alignments

Ultra-large Multiple Sequence Alignment

Advances in Phylogenomic Estimation

Advances in Phylogenomic Estimation

Presentation transcript:

Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign Supported by NSF grant DBI

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree

Phylogenomics = genome-scale phylogeny estimation Note: Jonathan Eisen coined this term, but used it to mean something else.

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus Compute species tree or network: – Compute gene trees on the alignments and combine the estimated gene trees, OR – Estimate a tree from a single site from each locus – Estimate a tree from a concatenation of the multiple sequence alignments, OR Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus Compute species tree or network: – Compute gene trees on the alignments and combine the estimated gene trees, OR – Estimate a tree from a single site from each locus – Estimate a tree from a concatenation of the multiple sequence alignments, OR Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Gene trees inside the species tree (Coalescent Process) Present Past Courtesy James Degnan Gorilla and Orangutan are not siblings in the species tree, but they are in the gene tree.

Incomplete Lineage Sorting (ILS) Two (or more) lineages fail to coalesce in the first ancestral population Chance of ILS depends on – Time: shorter branches make ILS likelier – Population size: wider branches increase ILS The most likely gene tree is not necessarily same as the species tree [Degnan, Rosenberg, 2006, PLOS genetics] JH Degnan, NA Rosenberg – Trends in ecology & evolution, 2009

New Coalescent-based Methods New summary methods: – ASTRAL, BUCKy-pop, NJst, Phylonet, STAR, STEM, etc. “Binning” genes – Naïve binning, Statistical Binning, Weighted Statistical Binning Site-based methods: – SNAPP, SVDquartets Co-estimation methods: – BEST, *BEAST, BBCA

New Coalescent-based Methods New summary methods: – ASTRAL, BUCKy-pop, NJst, Phylonet, STAR, STEM, etc. “Binning” genes – Naïve binning, Statistical Binning, Weighted Statistical Binning Site-based methods: – SNAPP, SVDquartets Co-estimation methods: – BEST, *BEAST, BBCA

Phylogenetic Network From L. Nakhleh, “Evolutionary Phylogenetic Networks: Models and Issues”, in Problem Solving Handbook in Computational Biology and Bioinformatics, Springer.

Metagenomic Taxon Identification Objective: classify short reads in a metagenomic sample

Scientific challenges: Ultra-large multiple-sequence alignment Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics The Tree of Life: Multiple Challenges

Schedule Monday, 1:00-1:10 Introductions 1:10-1:50 Tandy Warnow (Multiple sequence alignment)Tandy Warnow (Multiple sequence alignment) 1:50-2:30 Laura Kubatko (Phylogenomics: SVDquartets)Laura Kubatko (Phylogenomics: SVDquartets) 2:30-3:10 Siavash Mirarab (Phylogenomics: ASTRAL)Siavash Mirarab (Phylogenomics: ASTRAL) 3:10-3:25 Break 3:25-4:05 Nam Nguyen (TIPP: metagenomic data analysis)Nam Nguyen (TIPP: metagenomic data analysis) 4:05-4:45 Luay Nakhleh (Phylogenetic Networks)Luay Nakhleh (Phylogenetic Networks) 4:45-5:15 Downloading and installing software

Software School Schedule May 19, 2015: Software School (USB 2244 and USB 1250) 9-10 AM: ASTRAL in USB 1250 (taught by Siavash Mirarab)Siavash Mirarab) TIPP in USB 2244 (taught by Nam-Phuong Nguyen)Nam-Phuong Nguyen) AM: UPP in USB 2244 (taught by Nam-Phuong Nguyen)Nam-Phuong Nguyen) Phylonet in USB 1250 (taught by Yun Yu and Luay Nakhleh)Yun Yu and Luay Nakhleh) 11 AM - 12 noon: PASTA in USB 1250 (taught by Siavash Mirarab).Siavash Mirarab). 12 noon - 1:30 PM: lunch break 1:30 PM - 2:30 PM: SVDquartets in USB 1250 (taught by Laura Kubatko and Dave SwoffordLaura Kubatko and Dave Swoffo

PASTA and UPP: two new methods for multiple sequence alignment Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign

Orangutan Gorilla ChimpanzeeHuman From the Tree of the Life Website, University of Arizona Phylogeny (evolutionary tree)

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Problem TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

AGAT TAGACTTTGCACAATGCGCTT AGGGCATGA UVWXY U VW X Y The “real” problem

…ACGGTGCAGTTACCA… MutationDeletion …ACCAGTCACCA… Indels (insertions and deletions)

…ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… The true multiple alignment – Reflects historical substitution, insertion, and deletion events – Defined using transitive closure of pairwise alignments computed on edges of the true tree … ACGGTGCAGTTACCA … Substitution Deletion … ACCAGTCACCTA … Insertion

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 2: Construct tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

First Align, then Compute the Tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

Multiple Sequence Alignment (MSA): another grand challenge 1 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- … Sn = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

This talk Simulation studies comparing two-phase methods SATé (Science 2009, Systematic Biology 2012) PASTA (RECOMB 2014 and JCB 2014), co-estimation of alignments and trees (improved version of SATé) UPP (RECOMB 2015 and Genome Biology, in press), designed for datasets with fragmentary sequences, or with structural alignments PASTA and UPP can analyze datasets with 1,000,000 sequences, and are highly accurate All methods are available in open-source form, on github. Tutorials tomorrow.

Simulation Studies S1S2 S3S4 S1 = - AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = - AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA Compare True tree and alignment S1S4 S3S2 Estimated tree and alignment Unaligned Sequences

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FN FP

Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. RAxML: heuristic for large-scale ML optimization

1000-taxon models, ordered by difficulty (Liu et al., 2009)

Large-scale Alignment Estimation Many genes are considered unalignable due to high rates of evolution Only a few methods can analyze large datasets iPlant (NSF Plant Biology Collaborative) and other projects planning to construct phylogenies with 500,000 taxa

1kp: Thousand Transcriptome Project First study (Wickett, Mirarab, et al., PNAS 2014) had ~100 species and ~800 genes, gene trees and alignments estimated using SATe, and a coalescent-based species tree estimated using ASTRAL Second study: Plant Tree of Life based on transcriptomes of ~1200 species, and more than 13,000 gene families (most not single copy) Gene Tree Incongruence G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UIUC UT-Austin UT-Austin UT-Austin Upcoming Challenges: Species tree estimation from conflicting gene trees Alignment of datasets with > 100,000 sequences Plus many many other people…

Re-aligning on a tree A B D C Merge sub- alignments Estimate ML tree on merged alignment Decompose dataset AB CD Align subproblem s AB CD ABCD

SATé and PASTA Algorithms Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment If new alignment/tree pair has worse ML score, realign using a different decomposition Repeat until termination condition (typically, 24 hours)

1000 taxon models, ordered by difficulty SATé-1 24 hour analysis, on desktop machines (Similar improvements for biological datasets) SATé-1 can analyze up to about 30,000 sequences. SATé-1 (Science 2009) performance

1000 taxon models ranked by difficulty SATé-1 and SATé-2 (Systematic Biology, 2012)

PASTA (2014): even better than SATé-2 PASTA vs. SATé-2 (a) Faster, (b) Can analyze larger datasets (up to 1,000,000 sequences – SATé-2 can analyze 50,000 sequences) (c)More accurate!

1kp: Thousand Transcriptome Project Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) Gene Tree Incongruence G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UIUC UT-Austin UT-Austin UT-Austin Challenge: Alignment of datasets with > 100,000 sequences Plus many many other people…

1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary

1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary All standard multiple sequence alignment methods we tested performed poorly on datasets with fragments.

1kp: Thousand Transcriptome Project Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) Gene Tree Incongruence G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, Md. S.Bayzid UIUC UT-Austin UT-Austin UT-Austin Challenge: Alignment of datasets with > 100,000 sequences with many fragmentary sequences Plus many many other people…

UPP: large-scale MSA estimation UPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles” Nguyen, Mirarab, and Warnow. In press, Genome Biology Available in open source form on github

Profile HMMs

A simple idea Select random subset of sequences, and build “backbone alignment” Construct a profile Hidden Markov Model (HMM) to represent the backbone alignment Add all remaining sequences to the backbone alignment using the HMM Fast! But is it accurate?

A simple idea Select random subset of sequences, and build “backbone alignment” Construct a profile Hidden Markov Model (HMM) to represent the backbone alignment Add all remaining sequences to the backbone alignment using the HMM Fast! But is it accurate?

Simple technique: 1) build one HMM for the entire alignment 2) Align fragment to the HMM, and insert into alignment

One Hidden Markov Model for the entire alignment?

Or 2 HMMs?

Or 4 HMMs?

Or 7 HMMs?

UPP Algorithmic Approach Select random subset of sequences, and build “backbone alignment” Construct an “Ensemble of Hidden Markov Models” on the backbone alignment (the family has HMMs on overlapping subsets of different sizes) Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

RNASim: alignment error Note: Mafft was run under default settings for 10K and 50K sequences and under Parttree for 100K sequences, and fails to complete under any setting For 200K sequences. Clustal-Omega only completes on 10K dataset. All methods given 24 hrs on a 12-core machine

RNASim: tree error Note: Mafft was run under default settings for 10K and 50K sequences and under Parttree for 100K sequences, and fails to complete under any setting For 200K sequences. Clustal-Omega only completes on 10K dataset. All methods given 24 hrs on a 12-core machine

RNASim Million Sequences: alignment error Notes: We show alignment error using average of SP-FN and SP-FP. UPP variants have better scores than PASTA. But for the Total Column (TC) scores, PASTA is better than UPP: it recovered 10% of the columns compared to less than 0.04% for UPP variants. PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 to 1.38 wider).

RNASim Million Sequences: tree error Notes: UPP(Fast,NoDecomp) took 2.2 days, UPP(Fast) took 11.9 days, and PASTA took 10.3 days (all using 12 processors).

Performance on fragmentary datasets of the 1000M2 model condition UPP vs. PASTA: impact of fragmentation Under high rates of evolution, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). Under low rates of evolution, PASTA can still be highly accurate (data not shown). UPP continues to have good accuracy even on datasets with many fragments under all rates of evolution.

Running Time Wall-clock time used (in hours) given 12 processors

Summary SATé-1 (Science 2009), SATé-2 (Systematic Biology 2012), and PASTA (RECOMB 2014): methods for co-estimating gene trees and multiple sequence alignments. PASTA can analyze up to 1,000,000 sequences, and is highly accurate for full-length sequences. But none of these methods are robust to fragmentary sequences. PASTA TUTORIAL TOMORROW HMM Ensemble technique: uses a collection of HMMs to represent a “backbone alignment”. HMM ensembles improve accuracy, especially in the presence of high rates of evolution. Applications of HMM Ensembles in: – UPP (ultra-large multiple sequence alignment), Genome Biology (in press) – TUTORIAL TOMORROW – SEPP (phylogenetic placement), PSB 2012 (not shown) – TIPP (metagenomic taxon identification and abundance profiling), Bioinformatics 2014 (not shown) – TUTORIAL TOMORROW

Acknowledgments PhD students: Nam Nguyen* and Siavash Mirarab** Undergrad: Keerthana Kumar Lab Website: Personal Website: Write to me: (I am recruiting Funding: Guggenheim Foundation, NSF, Microsoft Research New England, David Bruton Jr. Centennial Professorship, TACC (Texas Advanced Computing Center), and the University of Alberta (Canada) TACC and UTCS computational resources * Now a postdoc with Becky Stumpf and Bryan White (ICB, UIUC) ** Supported by HHMI Predoctoral Fellowship

Tutorial Downloads PASTA: tutorial.mdhttps://github.com/smirarab/pasta/blob/master/pasta-doc/pasta- tutorial.md UPP: ASTRAL Phylonet TIPP SVDquartets: will be run within PAUP*, Please go to these sites to obtain the software, and to get instructions on how to download, install, and run the software. (Do this today, in advance of the software school tomorrow.)

Scientific challenges: Ultra-large multiple-sequence alignment Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: machine learning, applied probability theory, graph theory, combinatorial optimization, supercomputing, and heuristics The Tree of Life: Multiple Challenges