Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.

Slides:



Advertisements
Similar presentations
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Advertisements

Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Phylogenomics Symposium and Software School Tandy Warnow Departments of Computer Science and Bioengineering The University of Illinois at Urbana-Champaign.
Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.
Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.
Recent breakthroughs in mathematical and computational phylogenetics
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
IPAM Workshop on Multiple Sequence Alignment Tandy Warnow The University of Illinois at Urbana-Champaign.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenomics and Metagenomics Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Computational Phylogenomics and Metagenomics Tandy Warnow Departments of Bioengineering and Computer Science The University of Illinois at Urbana-Champaign.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
New techniques that “boost” methods for large-scale multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.
Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign
CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.
Challenge and novel aproaches for multiple sequence alignment and phylogenetic estimation Tandy Warnow Department of Computer Science The University of.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
394C, October 2, 2013 Topics: Multiple Sequence Alignment
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.
TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.
Family of HMMs Nam Nguyen University of Texas at Austin.
New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.
The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.
Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.
CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.
Approaching multiple sequence alignment from a phylogenetic perspective Tandy Warnow Department of Computer Sciences The University of Texas at Austin.
Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.
SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.
Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation Tandy Warnow Departments of Bioengineering and Computer Science The University.
Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.
Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.
CS 466 and BIOE 498: Introduction to Bioinformatics
CS 581 / BIOE 540: Algorithmic Computational Genomics
Chalk Talk Tandy Warnow
Multiple Sequence Alignment Methods
Techniques for MSA Tandy Warnow.
Algorithm Design and Phylogenomics
Large-Scale Multiple Sequence Alignment
CS 581 Algorithmic Computational Genomics
Tandy Warnow Founder Professor of Engineering
New methods for simultaneous estimation of trees and alignments
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.
New methods for simultaneous estimation of trees and alignments
Ultra-large Multiple Sequence Alignment
Advances in Phylogenomic Estimation
Advances in Phylogenomic Estimation
Presentation transcript:

Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology The University of Illinois at Urbana-Champaign

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree

Phylogenies and Applications Basic Biology: How did life evolve? Applications of phylogenies to: protein structure and function population genetics human migrations metagenomics

“Nothing in biology makes sense except in the light of evolution” – Theodosius Dobzhansky, 1973 essay in the American Biology Teacher, vol. 35, pp “…... nothing in evolution makes sense except in the light of phylogeny...” – Society of Systematic Biologists,

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Phylogeny Estimation TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

Maximum Likelihood Phylogeny Estimation: A Grand Challenge TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

Phylogeny + genomics = genome-scale phylogeny estimation

Incomplete Lineage Sorting (ILS) is a dominant cause of gene tree heterogeneity

Phylogenomic Pipelines Finding related genomic regions (homology detection) Multiple sequence alignment Maximum Likelihood phylogeny estimation for single genes Species tree estimation from multiple conflicting genes Answer biological questions

Computational Grand Challenges NP-hard problems Exact solutions infeasible – heuristics necessary Parallelism helpful but does not address scalability with number of species Large datasets: 100,000+ sequences 10,000+ genes “BigData” complexity

Avian Phylogenomics Project E Jarvis, HHMI G Zhang, BGI Approx. 50 species, whole genomes, 14,000 loci Jarvis, Mirarab, et al., Science 2014 MTP Gilbert, Copenhagen S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin T. Warnow UT-Austin Plus many many other people… Challenge #1: Maximum likelihood analysis of concatenated multiple sequence alignments took 250 CPU years using supercomputers around the globe. Challenge #2: Massive gene tree heterogeneity

1kp: Thousand Transcriptome Project PNAS 2014 (about 100 species and 800 genes) Second analysis underway, much larger dataset (~1200 species and ~1000 loci) Gene Tree Incongruence G. Ka-Shu Wong U Alberta N. Wickett Northwestern J. Leebens-Mack U Georgia N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Challenge #1: Construct multiple sequence alignment and tree for more than 100,000 sequences Challenge #2: Massive gene tree heterogeneity Plus many many other people…

Three of my favorite Grand Challenges Multiple Sequence Alignment: Methods for large-scale MSA (up to 1,000,000 sequences, including fragments) Phylogenomics: Methods for multi-locus species tree estimation that are robust to gene tree incongruence due to incomplete lineage sorting (ILS) and horizontal gene transfer (HGT) Supertree estimation: Methods that combine smaller trees into larger trees (useful for divide-and-conquer)

Multiple Sequence Alignment (MSA): an important grand challenge 1 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- … Sn = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Consequences of MSA difficulties Biologists restrict dataset size or use inadequate methods to be able to analyze their data. Alignment accuracy is reduced, which impacts downstream analyses: Protein structure and function prediction Metagenomic taxon identification Phylogeny estimation Detection of positive selection Scientific discoveries are jeapordized.

Today’s talk: MSA estimation PASTA: divide-and-conquer method to “boost” a MSA method Default PASTA (using MAFFT): can align 1,000,000 sequences with high accuracy and speed PASTA+BAli-Phy: integrating Bayesian statistical alignment method into PASTA, scaling to 10,000 sequences

Today’s talk: MSA estimation PASTA: divide-and-conquer method to “boost” a MSA method Default PASTA (using MAFFT): can align 1,000,000 sequences with high accuracy and speed PASTA+BAli-Phy: integrating Bayesian statistical alignment method into PASTA, scaling to 10,000 sequences

…ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… The true multiple alignment – Reflects historical substitution, insertion, and deletion events – Defined using transitive closure of pairwise alignments computed on edges of the true tree … ACGGTGCAGTTACCA … Substitution Deletion … ACCAGTCACCTA … Insertion

Input: unaligned sequences S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 1: Alignment S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Phase 2: Construct tree S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 S4 S2 S3 Statistical co-estimation would be much better!!!

Simulation Studies S1S2 S3S4 S1 = - AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT GACCGC-- S4 = TCAC--GACCGACA S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA S1 = - AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA Compare True tree and alignment S1S4 S3S2 Estimated tree and alignment Unaligned Sequences

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) FN FP 50% error rate

Two-phase estimation Alignment methods Clustal POY (and POY*) Probcons (and Probtree) Probalign MAFFT Muscle Di-align T-Coffee Prank (PNAS 2005, Science 2008) Opal (ISMB and Bioinf. 2007) FSA (PLoS Comp. Bio. 2009) Infernal (Bioinf. 2009) Etc. Phylogeny methods Bayesian MCMC Maximum parsimony Maximum likelihood Neighbor joining FastME UPGMA Quartet puzzling Etc. RAxML: heuristic for large-scale ML optimization

1000-taxon models, ordered by difficulty (Liu et al., 2009)

Key Observations Datasets that are large and have high rates of evolution are difficult to align accurately. However, datasets with slow rates of evolution can be aligned with high accuracy. Not all MSA methods can run on large datasets (and some cannot even run on moderate-sized datasets). These observations suggest divide-and-conquer to boost MSA methods to larger datasets.

Re-aligning on a tree (boosting an MSA method) A B D C Merge sub- alignments Estimate ML tree on merged alignment Decompose dataset AB CD Align subsets AB CD ABCD

SATé and PASTA Algorithms Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score

1000-taxon models, ordered by difficulty – rate of evolution generally increases from left to right SATé-1 24 hour analysis, on desktop machines (Similar improvements for biological datasets) SATé-1 can analyze up to about 8,000 sequences. SATé-1 (Science 2009) performance

1000-taxon models ranked by difficulty SATé-1 vs. SATé-2 (Systematic Biology, 2012) SATé-1: up to 8K SATé-2: up to ~50K

Simulated RNASim datasets from 10K to 200K taxa Limited to 24 hours using 12 CPUs Not all methods could run (missing bars could not finish) PASTA: even better than SATé-2

UPP UPP = “Ultra-large multiple sequence alignment using Phylogeny-aware Profiles” Nguyen, Mirarab, and Warnow. Genome Biology, Purpose: highly accurate large-scale multiple sequence alignments, even in the presence of fragmentary sequences.

UPP Algorithmic Approach 1.Select small random subset of full-length sequences, and build “backbone alignment” 2.Construct an “Ensemble of Hidden Markov Models” on the backbone alignment 3.Add all remaining sequences to the backbone alignment using the Ensemble of HMMs

RNASim Million Sequences: alignment error Notes: We show alignment error using average of SP-FN and SP-FP. UPP variants have better alignment scores than PASTA. (Not shown: Total Column Scores – PASTA more accurate than UPP) No other methods tested could complete on these data PASTA under-aligns: its alignment is 43 times wider than true alignment (~900 Gb of disk space). UPP alignments were closer in length to true alignment (0.93 to 1.38 wider).

RNASim Million Sequences: tree error Using 12 TACC processors: UPP(Fast,NoDecomp) took 2.2 days, UPP(Fast) took 11.9 days, and PASTA took 10.3 days

PASTA Running Time and Scalability One iteration Using 12 cpus 1 node on Lonestar TACC Maximum 24 GB memory Showing wall clock running time ~ 1 hour for 10k taxa ~ 17 hours for 200k taxa

PASTA and UPP: boosters of MSA methods PASTA – Combines iteration and divide-and-conquer to “boost” a preferred MSA method to large datasets; we showed results based on MAFFT UPP – Step 1: Constructs a “backbone” tree and an alignment on a small random subset of the sequences – Step 2: Aligns all the remaining sequences to the backbone alignment – We showed results where default PASTA computed the backbone alignment and tree. Challenge: Can we boost statistical alignment methods?

PASTA and UPP: boosters of MSA methods PASTA – Combines iteration and divide-and-conquer to “boost” a preferred MSA method to large datasets; we showed results based on MAFFT UPP – Step 1: Constructs a “backbone” tree and an alignment on a small random subset of the sequences – Step 2: Aligns all the remaining sequences to the backbone alignment – We showed results where default PASTA computed the backbone alignment and tree. Challenge: Can we boost statistical alignment methods?

BAli-Phy: leading statistical co-estimation method BAli-Phy (Redelings and Suchard, 2005): – Statistical co-estimation of the sequence alignment and the tree, using Bayesian MCMC. – Output can be a multiple sequence alignment, a phylogeny, or both, and can give estimate of uncertainty in each one. 41

BAli-Phy: better than PASTA # Taxa: SimulatorIndelible (DNA)RNAsim (RNA) Total-Column Score Alignment Accuracy (TC score) *Averages over 10 replicates Each dataset has 100 or 200 sequences. To run BAli-Phy on a single dataset, we used 32 Blue Waters processors and ran BAli-Phy on each processor for 24 hours. We then collected all the samples from all the processors, and computed the posterior decoding (PD) alignment.

But: BAli-Phy is limited to small datasets BAli-Phy is computationally intensive: – 63 sequence dataset (Gaya et al., 2011) took 3 weeks – Largest dataset analyzed had 117 sequences (McKenzie et al., 2014) BAli-Phy is not scalable: – Our study shows it breaks somewhere before 500 sequences (numerical issues possibly) From – Too many taxa? – “BAli-Phy is quite CPU intensive, and so we recommend using 50 or fewer taxa in order to limit the time required to accumulate enough MCMC samples. (Despite this recommendation, data sets with more than 100 taxa have occasionally been known to converge.) We recommend initially pruning as many taxa as possible from your data set, then adding some back if the MCMC is not too slow.”

Use BAli-Phy as the subset aligner A B D C Merge sub- alignments Estimate ML tree on merged alignment Decompose dataset AB CD Align subsets AB CD ABCD

Scaling BAli-Phy to 1K sequences (using PASTA) 45

UPP+BAli-Phy better than default UPP (Scaling BAli-Phy to 10K sequences)

PASTA+BAli-Phy on Blue Waters BAli-Phy Job on BW (Approx )... PASTA Postprocessing Each PASTA job with 1,000 sequences creates BAli-Phy jobs BAli-Phy jobs run asynchronously for 24 hours on single node Results are downloaded to separate server and processed to yield a single alignment. Each individual PASTA job takes node-hours that can run using the low-priority backfill queue on blue waters. Data: 5 Model Conditions 10 Replicates each 17.8k node-hours Additional exploratory data used to develop the method in the paper required an additional 2,164 BAli-Phy processes and 52k node-hours. PASTA Input: 1,000 Sequences Unused capacity on Blue Waters allows the divide-and-conquer alignment algorithm to use the large number of nodes for easily parallelized jobs.

Other projects on Blue Waters Coalescent-based species tree estimation Supertree methods Extensions of UPP (using ensembles of Hidden Markov Models): – Metagenomic taxon identification (collaboration with Bill Gropp and Mihai Pop) – Protein family classification (collaboration with Jian Peng) Blue Waters is essential for method development, testing, and refinement. We also use Blue Waters to analyze biological datasets – when the otherwise available infrastructure is inadequate.

Scientific challenges: Ultra-large multiple-sequence alignment Gene tree estimation Metagenomic classification Alignment-free phylogeny estimation Supertree estimation Estimating species trees from many gene trees Genome rearrangement phylogeny Reticulate evolution Visualization of large trees and alignments Data mining techniques to explore multiple optima Theoretical guarantees under Markov models of evolution Techniques: applied probability theory, graph theory, supercomputing, and heuristics Testing: simulations and real data The Tree of Life: Multiple Challenges

Acknowledgments Papers available at PASTA and UPP at Funding: NSF ABI and III:AF: , a Founder Professorship from the Grainger Foundation, and HHMI (to S.M.) Computational support: BlueWaters (PASTA+BAliPhy) and TACC (PASTA and UPP)