Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.

Slides:

Advertisements

Similar presentations

Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.

Advertisements

Profile HMMs Tandy Warnow BioE/CS 598AGB. Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used.

J AMES A. FOSTER And Luke Sheneman 1 October 2008 I NITIATIVE FOR B IOINFORMATICS AND E VOLUTIONARY S TUDIES (IBEST) Guide Trees and Progressive Multiple.

Multiple sequence alignment methods: evidence from data CS/BioE 598 Tandy Warnow.

CIS786, Lecture 4 Usman Roshan.

CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.

Barking Up the Wrong Treelength Kevin Liu, Serita Nelesen, Sindhu Raghavan, C. Randal Linder, and Tandy Warnow IEEE TCCB 2009.

Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.

Ultra-large Multiple Sequence Alignment Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign

1 Phylogeny Workshop By Eyal PrivmanEyal Privman The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel November 2009

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science.

Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:

From basic Concepts to Advanced applications Molecular Evolution & Phylogeny By Ofir Cohen The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.

TIPP: Taxon Identification and Phylogenetic Profiling Tandy Warnow The Department of Computer Science The University of Texas at Austin.

Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Family of HMMs Nam Nguyen University of Texas at Austin.

Expected accuracy sequence alignment Usman Roshan.

Three approaches to large- scale phylogeny estimation: SATé, DACTAL, and SEPP Tandy Warnow Department of Computer Science The University of Texas at Austin.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

TIPP: Taxon Identification using Phylogeny-Aware Profiles Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign.

Quan Zou （ PH.D. & Prof. ） Tianjin Univ, School of Computer Reconstructing phylogenetic trees for.

Iterative-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees Usman Roshan and Tandy Warnow U. of Texas at Austin Bernard Moret.

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

Algorithms for Ultra-large Multiple Sequence Alignment and Phylogeny Estimation Tandy Warnow Department of Computer Science The University of Texas at.

Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),

Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.

Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.

Ensembles of HMMs and their use in biomolecular sequence analysis Nam-phuong Nguyen Carl R. Woese Institute for Genomic Biology University of Illinois.

Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

New Approaches for Inferring the Tree of Life

Chalk Talk Tandy Warnow

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Multiple Sequence Alignment Methods

The ideal approach is simultaneous alignment and tree estimation.

Evolution of Kryptolebias marmoratus from the order Teleost Using Vasa as a molecular marker by Jacob L. Perry.

Techniques for MSA Tandy Warnow.

Pipelines for Computational Analysis (Bioinformatics)

Algorithm Design and Phylogenomics

Sequence comparison: Significance of similarity scores

Tandy Warnow The University of Illinois

Large-Scale Multiple Sequence Alignment

TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles

TIPP: Taxon Identification using Phylogeny-Aware Profiles

Tandy Warnow Founder Professor of Engineering

New methods for simultaneous estimation of trees and alignments

Benchmarking Statistical Multiple Sequence Alignment

Sequence comparison: Multiple testing correction

Gene Tree Estimation Through Affinity Propagation

CS 394C: Computational Biology Algorithms

Taxonomic identification and phylogenetic profiling

Algorithms for Inferring the Tree of Life

New methods for simultaneous estimation of trees and alignments

Ultra-large Multiple Sequence Alignment

Advances in Phylogenomic Estimation

Advances in Phylogenomic Estimation

TIPP and SEPP (plus PASTA)

Scaling Species Tree Estimation to Large Datasets

Presentation transcript:

Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1

BAli-Phy: Brief Summary What is BAli-Phy? (Redelings & Suchard, 2005) –Software from 2005 that takes as input unaligned sequences and co- estimates the alignment and the phylogeny in a way that accounts for indels. –Output can be a multiple sequence alignment, a phylogeny, or both, and can give estimate of uncertainty in each one. Why is it interesting? –The statistical model is unique and detailed, so given enough time it might find a better optimum than other methods. –Experimental evidence has shown that it gives more accurate multiple sequence alignments than more common methods (Liu, et al, 2012). 1 Redelings, B. D., & Suchard, M. a. (2005). Joint Bayesian estimation of alignment and phylogeny. Systematic Biology, 54(3), 401– Liu, K., Raghavan, S., Nelesen, S., Linder, C. R., & Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science (New York, N.Y.), 324(5934), 1561–4. 2

BAli-Phy: Brief Summary What is the disadvantage? –Software cannot handle more than ≈200 taxa due to suspected numerical instability. –Computation is very slow: most publications using it have run for several weeks. (Gaya, et al., 2011) 68 sequences ran in 3 weeks Largest data set we have found is 117-sequences (McKenzie, et al., 2014) 3

BAli-Phy: Quick Look at Results (1 of 2) # Taxa: Simulator:Indelible (DNA)RNAsim (RNA) Alignment Error* *Averages over 10 replicates 4 False Negative % (1 – SP-Score) False Positive % (1 – Modeler Score)

BAli-Phy: Quick Look at Results (2 of 2) # Taxa: Simulator:Indelible (DNA)RNAsim (RNA) Total-Column Score Alignment Accuracy* *Averages over 10 replicates Total-Column Score: Percentage of columns from the reference alignment that are fully reproduced by the estimated alignment. 5

Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score SATé and PASTA Algorithms 6

PASTA Algorithm 7 Input: unaligned sequences 1) Get initial alignment 2) Estimate tree on current alignment 3) Break into subsets according to tree 4) Use external aligner to align subsets 5) Use external profile aligner to merge subset alignments 6) Use transitivity to merge subset pairs into a full alignment, scrap the old tree (repeat) ?

Divide and Conquer with BAli-Phy To get the most out of PASTA+BAli-Phy, start with the tree from the LAST iteration of default PASTA QUESTION: Since we saw that BAli-Phy gives better alignments on small numbers of taxa, could we get better alignment on large data sets if we used BAli-Phy on subsets? ANSWER: Yes, but it takes a lot of computing resources. 8

Methods to Compare PASTA: –Iterations 1-3: MAFFT (Subset Size 200) –(all default settings) PASTA+BAli-Phy –Iterations 1-3: MAFFT (Subset Size 200) –Iteration 4: BAli-Phy (Subset Size 100) PASTA+MAFFT –Iterations 1-3: MAFFT (Subset Size 200) –Iteration 4: MAFFT (Subset Size 100) MAFFT L-INS-i –Default MAFFT (v7.273) using the mafft-linsi command Takes advantage of faster MAFFT on early iterations where subsets are more diverse. 9 Helpful to identify whether any gain is from extra iteration or because BAli-Phy was used. All default settings… Useful comparison to benchmark difficulty of alignment. MAFFT is popular and L-INS-i is the most accurate version.

Error Reduction (1000 Sequences) 10 False Negative % (i.e. 1 - SP-Score) False Positive % (i.e. 1 – Modeler Score) (Smaller is better)

Accuracy Gain (1000 Sequences) 11 Total Column Score

Tree Error Relative to ML(Reference Alignment) 12 Delta-RF (RAxML)

Accuracy Gain (1000 Sequences, Detail) 13

Accuracy Gain (1000 Sequences, Detail) 14

Scaling to 10,000 Sequences We can use UPP (Nguyen, et al, 2015) to extend an alignment to larger numbers of sequences: –Take a random “backbone” subset (i.e. 1,000 sequences from previous slides) –Align the backbone –Align all remaining sequences to the backbone via HMMs 15

Scaling to 10,000 Sequences Accuracy of full alignment tends to track the accuracy of the backbone: 16

Scaling to 10,000 Sequences 17

Hypothetical Running Time Comparison Data: 1000 Taxa Goal: Run PASTA (1 iteration, maximum subset size 100) Resource: 1 Server, 32 Cores Question: How long does each subset- alignment method take? Answer: With 1,000 taxa and subsets no larger than 100, we’ll have approximately 15 subsets (between 10-16). MAFFT: –Takes 1 core approx minutes to do 1 subset. Can do all subsets in parallel. –Total: minutes BAli-Phy: –Takes 24 hours for all 32 cores to do 1 subset. Can’t run in parallel since we only have 32 cores. –Total: 15 Days These calculations are hypothetical but representative. If we have multiple servers, we can run BAli-Phy in parallel in less time… Still, if we want to run BAli-Phy it makes the most sense to several iterations with MAFFT first. 18

Summary BAli-Phy provides more accurate alignments than MAFFT on small data, which can translate to more accurate alignments on up to 1,000 taxa by boosting with PASTA. Although the running time is longer, this allows BAli-Phy to be scaled in a parallel way, so we can align 1,000 sequences in the time it takes to align 100. Alignment accuracy translates to improved tree accuracy on this data. Alignments can be further extended to 10,000 sequences using UPP Code at:

Acknowledgements Special Thanks This is a joint work with my advisor, Prof. Tandy Warnow. Special thanks to Nam Nguyen for early collaboration on this project and for teaching me the codebase for PASTA and UPP, and to Erin Molloy for several helpful discussions and suggestions over the course of this research. NSF This work was funded by NSF grant III:AF: Blue Waters This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI and ACI ) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. 20

References [1.] Redelings, B. D., & Suchard, M. (2005). Joint Bayesian estimation of alignment and phylogeny. Systematic Biology, 54(3), 401–418. [2.] Liu, K., Raghavan, S., Nelesen, S., Linder, C. R., & Warnow, T. (2009). Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science (New York, N.Y.), 324(5934), 1561–4. [3.] Katoh, K., Misawa, K., Kuma, K., & Miyata, T. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research, 30(14), 3059–3066. [4.] Mirarab, S., Nguyen, N., Guo, S., Wang, L.-S., Kim, J., & Warnow, T. (2015). PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences. Journal of Computational Biology, 22(5), 377–386. [5.] McKenzie, S. K., Oxley, P. R., & Kronauer, D. J. C. (2014). Comparative genomics and transcriptomics in ants provide new insights into the evolution and function of odorant binding and chemosensory proteins. BMC Genomics, 15(1), 718. [6.] Liu, K., Warnow, T. J., Holder, M. T., Nelesen, S. M., Yu, J., Stamatakis, A. P., & Linder, C. R. (2012). SATe-II: Very Fast and Accurate Simultaneous Estimation of Multiple Sequence Alignments and Phylogenetic Trees. Systematic Biology, 61(1), 90–106. [7.] Nguyen, N. D., Mirarab, S., Kumar, K., & Warnow, T. (2015). Ultra-large alignments using phylogeny-aware profiles. Genome Biology, 16(1),