Ultra-large Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.
Advertisements

Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Family of HMMs Nam Nguyen University of Texas at Austin.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.
Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.
Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
CS 466 and BIOE 498: Introduction to Bioinformatics
Constrained Exact Optimization in Phylogenetics
Advances in Ultra-large Phylogeny Estimation
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
CS 581 / BIOE 540: Algorithmic Computational Genomics
Chalk Talk Tandy Warnow
Distance based phylogenetics
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Multiple Sequence Alignment Methods
Tandy Warnow Department of Computer Sciences
The ideal approach is simultaneous alignment and tree estimation.
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Mathematical and Computational Challenges in Reconstructing Evolution
New methods for simultaneous estimation of trees and alignments
Large-Scale Multiple Sequence Alignment
Mathematical and Computational Challenges in Reconstructing Evolution
TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles
CS 581 Algorithmic Computational Genomics
TIPP: Taxon Identification using Phylogeny-Aware Profiles
Tandy Warnow Founder Professor of Engineering
New methods for simultaneous estimation of trees and alignments
Texas, Nebraska, Georgia, Kansas
Benchmarking Statistical Multiple Sequence Alignment
Ultra-Large Phylogeny Estimation Using SATé and DACTAL
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Taxonomic identification and phylogenetic profiling
Figure 1. autoMLST workflow depicting placement and de novo mode
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
Figure 1. Analysis of human TRIM5α protein with Blast-Search and PhyML+SMS ‘One click’ workflow. (A) NGPhylogeny.fr ... Figure 1. Analysis of human TRIM5α.
Tandy Warnow The University of Texas at Austin
New methods for simultaneous estimation of trees and alignments
Fig. 2. —Phylogenetic relationships and motif compositions of some representative MORC genes in plants and animals. ... Fig. 2. —Phylogenetic relationships.
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
TIPP and SEPP (plus PASTA)
Fig. 1. —GO categories enriched in gene families showing high or low omega (dN/dS) values for Pneumocystis jirovecii. ... Fig. 1. —GO categories enriched.
Figure 2. Model adequacy results for the two empirical data sets, West African Ebola, and 2009 H1N1 influenza. The ... Figure 2. Model adequacy results.
Scaling Species Tree Estimation to Large Datasets
Presentation transcript:

Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

Phylogeny (evolutionary tree) Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Phylogenies and Applications Basic Biology: How did life evolve? Applications of phylogenies to: protein structure and function population genetics human migrations metagenomics

Hard Computational Problems NP-hard problems Large datasets 100,000+ sequences thousands of genes “Big data” complexity: model misspecification fragmentary sequences errors in input data streaming data

Phylogeny Problem AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U V W X Y X

Much is known about this problem from a mathematical and empirical viewpoint U V W X Y AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT X U Y V W

However… U V W X Y AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT X U Y V W

Indels (insertions and deletions) Mutation …ACGGTGCAGTTACCA… …ACCAGTCACCA… 8

…ACGGTGCAGTTACCA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree Homology = nucleotides lined up since they come from a common ancestor. Indel = dash. 9

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene trees Compute species tree or network: Combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus, and construct gene trees Compute species tree or network: Combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Multiple Sequence Alignment (MSA): a scientific grand challenge1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

This talk “Big data” multiple sequence alignment SATé (Science 2009, Systematic Biology 2012) and PASTA (RECOMB and J Comp Biol 2015), methods for co-estimation of alignments and trees UPP (Genome Biology 2015): ultra-large multiple sequence alignment, using the “Ensemble of HMMs technique”. Evaluating BAli-Phy on biological and simulated datasets

First Align, then Compute the Tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

Simulation Studies Unaligned Sequences Compare True tree and alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S2 S3 S4 S1 S4 S3 S2 Compare True tree and alignment Estimated tree and alignment

FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate

Two-phase estimation Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. RAxML: heuristic for large-scale ML optimization 20

1000-taxon models, ordered by difficulty (Liu et al., 2009) 1. 2 classes of MC: easy, moderate-to-difficult 2. true alignment 3. 2 classes: ClustalW, everything else Alignment error, measured this way, isn't a perfect predictor of tree error, measured this way. 1000-taxon models, ordered by difficulty (Liu et al., 2009) 21

SATé “Family” of methods Iterative divide-and-conquer methods Each iteration re-aligns the sequences using the current tree, running preferred MSA methods on small local subsets, and merging subset alignments Each iteration computes an ML tree on the current alignment, under the GTR (Generalized Time Reversible) Markov model of evolution Note: these methods are “MSA boosters”, designed to improve accuracy and/or scalability of the base method We show results using MAFFT-l-ins-i to align subsets

Estimate ML tree on merged alignment Re-aligning on a tree A B D C A B Decompose dataset C D Align subsets A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

SATé and PASTA Algorithms Tree Obtain initial alignment and estimated ML tree Estimate ML tree on new alignment Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

SATé-1 (Science 2009) performance 1000-taxon models, ordered by difficulty – rate of evolution generally increases from left to right For moderate-to-difficult datasets, SATe gets better trees and alignments than all other estimated methods. Close to what you might get if you had access to true alignment. Opens up a new realm of possibility: Datasets currently considered “unalignable” can in fact be aligned reasonably well. This opens up the feasibility of accurate estimations of deep evolutionary histories using a wider range of markers. TRANSITION: can we do better? What about smaller simulated datasets? And what about biological datasets? SATé-1 24 hour analysis, on desktop machines (Similar improvements for biological datasets) SATé-1 can analyze up to about 8,000 sequences.

1000-taxon models ranked by difficulty SATé-1 and SATé-2 (Systematic Biology, 2012) SATé-1: up to 8K SATé-2: up to ~50K 1000-taxon models ranked by difficulty

SATé variants differ only in the decomposition strategy B D C A B Decompose dataset C D Align subsets A B C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

PASTA merging: Step 1 D C E B A Compute a spanning tree connecting alignment subsets

PASTA merging: Step 2 CD AB BD DE D CD BD AB DE Use Opal (or Muscle) to merge adjacent subset alignments in the spanning tree

PASTA merging: Step 3 AB + BD = ABD ABD + CD = ABCD ABCD + DE = ABCDE Use transitivity to merge all pairwise-merged alignments from Step 2 into final an alignment on entire dataset Overall: O(n log(n) + L)

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people… First study (Wickett, Mirarab, et al., PNAS 2014) had ~100 species and ~800 genes, gene trees and alignments estimated using SATé, and a coalescent-based species tree estimated using ASTRAL Second study: Plant Tree of Life based on transcriptomes of ~1200 species, and more than 13,000 gene families (most not single copy) Gene Tree Incongruence Challenges: Species tree estimation from conflicting gene trees Gene tree estimation of datasets with > 100,000 sequences

1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary

datasets with fragments. 1KP dataset: more than 100,000 p450 amino-acid sequences, many fragmentary All standard multiple sequence alignment methods we tested performed poorly on datasets with fragments.

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen UIUC UT-Austin UT-Austin Plus many many other people… Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) Gene Tree Incongruence Challenge: Alignment of datasets with > 100,000 sequences with many fragmentary sequences

UPP UPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles” Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.

UPP Uses an ensemble of HMMs UPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles” Nguyen, Mirarab, and Warnow. Genome Biology, 2014. Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences. Uses an ensemble of HMMs

Simple idea (not UPP) Select random subset of sequences, and build “backbone alignment” Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

One Hidden Markov Model for the entire alignment?

Simple idea (not UPP) Select random subset of sequences, and build “backbone alignment” Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

Select random subset of sequences, and build “backbone alignment” This approach works well if the dataset is small and has low evolutionary rates, but is not very accurate otherwise. Select random subset of sequences, and build “backbone alignment” Construct a Hidden Markov Model (HMM) on the backbone alignment Add all remaining sequences to the backbone alignment using the HMM

One Hidden Markov Model for the entire alignment? HMM 1

Or 2 HMMs? HMM 1 HMM 2

Or 4 HMMs? HMM 1 HMM 2 the bit score doesn’t depend on the size of the sequence database, only on the profile HMM and the target sequence HMM 3 HMM 4

Or all 7 HMMs? HMM 1 HMM 2 m HMM 4 HMM 7 HMM 3 HMM 5 HMM 6 the bit score doesn’t depend on the size of the sequence database, only on the profile HMM and the target sequence HMM 3 HMM 5 HMM 6

UPP Algorithmic Approach Select random subset of full-length sequences, and build “backbone alignment” Construct an “Ensemble of Hidden Markov Models” on the backbone alignment Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

Evaluation Simulated datasets (some have fragmentary sequences): 10K to 1,000,000 sequences in RNASim – complex RNA sequence evolution simulation 1000-sequence nucleotide datasets from SATé papers 5000-sequence AA datasets (from FastTree paper) 10,000-sequence Indelible nucleotide simulation Biological datasets: Proteins: largest BaliBASE and HomFam RNA: 3 CRW datasets up to 28,000 sequences

RNASim: alignment error All methods given 24 hrs on a 12-core machine Note: Mafft was run under default settings for 10K and 50K sequences and under Parttree for 100K sequences, and fails to complete under any setting For 200K sequences. Clustal-Omega only completes on 10K dataset.

RNASim: tree error All methods given 24 hrs on a 12-core machine Note: Mafft was run under default settings for 10K and 50K sequences and under Parttree for 100K sequences, and fails to complete under any setting For 200K sequences. Clustal-Omega only completes on 10K dataset.

RNASim Million Sequences: alignment error Notes: We show alignment error using average of SP-FN and SP-FP. UPP variants have better alignment scores than PASTA. (Not shown: Total Column Scores – PASTA more accurate than UPP) No other methods tested could complete on these data PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 to 1.38 wider).

RNASim Million Sequences: tree error Using 12 processors: UPP(Fast,NoDecomp) took 2.2 days, UPP(Fast) took 11.9 days, and PASTA took 10.3 days

UPP vs. PASTA: impact of fragmentation Under high rates of evolution, PASTA is badly impacted by fragmentary sequences (the same is true for other methods). Under low rates of evolution, PASTA can still be highly accurate (data not shown). UPP continues to have good accuracy even on datasets with many fragments under all rates of evolution. Performance on fragmentary datasets of the 1000M2 model condition

UPP Running Time Wall-clock time used (in hours) given 12 processors

Co-estimation would be much better!!! S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S4 S3

What about BAli-Phy?  BAli-Phy (Redelings and Suchard): leading method for statistical co-estimation of alignments and trees Like Bayesian phylogeny estimation, it is expected to be the most rigorous and accurate technique for estimating trees and alignments!

BAli-Phy: Better than PASTA! Alignment Accuracy (TC score) MAFFT PASTA BAli-Phy 40% 30% 20% 10% 0% # Taxa: Total-Column Score 100 Indelible (DNA) 200 100 RNAsim (RNA) 200 Simulator Simulated nucleotide datasets with 100 or 200 sequences (unpublished data from Mike Nute’s PhD dissertation). *Averages over 10 replicates

But: BAli-Phy is limited to small datasets From www.bali-phy.org/README.html, 5.2.1. Too many taxa? “BAli-Phy is quite CPU intensive, and so we recommend using 50 or fewer taxa in order to limit the time required to accumulate enough MCMC samples. (Despite this recommendation, data sets with more than 100 taxa have occasionally been known to converge.) We recommend initially pruning as many taxa as possible from your data set, then adding some back if the MCMC is not too slow.”

Estimate ML tree on merged alignment Re-aligning on a tree A B D C A B Decompose dataset C D Align subsets: MAFFT A B Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

Re-aligning on a tree Decompose dataset Align subsets: BAli-Phy?? Comment on subset size C D Estimate ML tree on merged alignment ABCD Merge sub-alignments

Results on 1000-sequence datasets (Comparing default PASTA to PASTA+BAli-Phy) Decomposition to 100-sequence subsets, one iteration of PASTA+BAli-Phy

Results on 10,000-sequence datasets (Comparing UPP variants where the backbone alignment is computed using either default PASTA or PASTA+BAli-Phy)

Benchmarking Statistical Multiple Sequence Alignment Nute, Saleh, and Warnow 2018 Systematic Biology syy068, 2018, doi:10.1093/sysbio/syy068.

Study design Goal: Evaluate Bali-Phy (Redelings and Suchard) on both biological and simulated datasets, in comparison to leading alignment methods on small protein sequence datasets (at most 27 sequences) Metrics: Modeller score (precision), SP-score (recall), Expansion ratio (normalized alignment length), and running time Datasets: 120 simulated datasets (6 model conditions) and 1192 biological datasets (4 biological benchmarks) Specific note: For each dataset, Bali-Phy was run independently on 32 processors for 48 hours, the burn-in was discarded, and the posterior decoding (PD) alignment was then computed. These Bali-Phy analyses used 230 CPU years on Blue Waters (supercomputer at NCSA).

Modeler vs SP-Score on 120 Simulated Datasets BAli-Phy is best! Figure 6. Modeler score (i.e., precision) versus SP-Score (i.e., recall) for MSA methods on simulated amino acid data sets with 27 sequences for 6 different model conditions that vary by the substitution rate and indel rate; averages over 20 replicates are shown. See Supplementary Excel File available on Dryad for actual numeric values. Unless provided in the caption above, the following copyright applies to the content of this slide: © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contactjournals.permissions@oup.com Systematic Biology, Volume 68, Issue 3, May 2019, Pages 396–411, https://doi.org/10.1093/sysbio/syy068 The content of this slide may be subject to copyright: please see the slide notes for details.

Expansion Ratios on 120 Simulated Datasets BAli-Phy is best!

Modeler score vs SP-score on 1192 biological datasets T-Coffee and PROMALS are best! BAli-Phy good for Modeler score, but not so good for SP-Score (e.g., MAFFT better) Figure 2. Average Modeler Score (i.e., precision) versus SP-Score (i.e., recall) of all alignment methods on the individual biological benchmarks. Results shown are for 1192 data sets from the four benchmark collections (658 from BAliBase, 231 from Homstrad, 202 from Mattbench, and 101 from Sisyphus) See Supplementary Table S2 and Excel File available on Dryad for actual numeric values. Unless provided in the caption above, the following copyright applies to the content of this slide: © The Author(s) 2018. Published by Oxford University Press, on behalf of the Society of Systematic Biologists.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contactjournals.permissions@oup.com Systematic Biology, Volume 68, Issue 3, May 2019, Pages 396–411, https://doi.org/10.1093/sysbio/syy068 The content of this slide may be subject to copyright: please see the slide notes for details.

Modeler Score on 1192 Biological datasets BAli-Phy has the best modeler score

SP-score on 1192 Biological datasets BAli-Phy not competitive for SP-score (but best method depends on % ID)

Expansion Ratio on 1192 Biological datasets BAli-Phy under-aligns

Running Time on 4 biological datasets with 17 sequences each BAli-Phy benefits from a long running time. Therefore, we used >2 months for each dataset.

Observations Bali-Phy is much more accurate than all other methods on simulated datasets Bali-Phy is generally less accurate than the top half of these methods on biological datasets, especially with respect to SP-score (recall) Average percent pairwise ID impacts all the measures of accuracy for all methods, and changes relative performance

We do not know why there is a difference in accuracy. Most likely not an issue of failure of the MCMC analyses to converge (48 hours, 32 processors, small numbers of sequences). Possible explanations: Model misspecification (proteins don’t evolve under the Bali-Phy model) Structural alignments and evolutionary alignments are different The structural alignments are not correct (perhaps over-aligned) All these explanations are likely true, but the relative contributions are unknown.

Final comments MSA is challenging, but algorithmic techniques can improve accuracy and scalability: Dataset size can be addressed using good divide-and-conquer approaches. Heterogeneity in sequence length can be addressed using “local alignment” approaches, such as profile HMMs, with ensembles of profile HMMs providing improved accuracy. Yet the differences between performance on biological and simulated datasets is troubling.

The Tree of Life: Multiple Challenges Scientific challenges: Ultra-large multiple-sequence alignment Gene tree estimation Metagenomic classification Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: applied probability theory, graph theory, supercomputing, and heuristics Testing: simulations and real data

Acknowledgments PASTA and UPP: Nam Nguyen (now postdoc at UIUC) and Siavash Mirarab (now faculty at UCSD), undergrad: Keerthana Kumar (at UT-Austin) PASTA+BAli-Phy: Mike Nute (PhD student at UIUC) Evaluating BAli-Phy: Mike Nute and Ehsan Saleh (PhD students at UIUC) Current NSF grants: ABI-1458652 (multiple sequence alignment) Grainger Foundation (at UIUC), and UIUC TACC, UTCS, Blue Waters, and UIUC campus cluster PASTA, UPP, SEPP, and TIPP are available on github at https://github.com/smirarab/; see also PASTA+BAli-Phy at http://github.com/MGNute/pasta Papers available at http://tandy.cs.illinois.edu/MSAproject.html