Ultra-large alignments using Ensembles of HMMs Nam-phuong Nguyen Institute for Genomic Biology University of Illinois at Urbana-Champaign
UPP: Ultra-large alignment UPP: Ultra-large alignments using Phylogeny- aware Profiles Objective: Estimate accurate alignments on large datasets, which may be evolutionarily divergent and contain fragmentary sequences Nguyen N., Mirarab S., Kumar K., and Warnow, T. RECOMB 2015.
UPP Algorithmic Strategy
RNASim: alignment error Note: All methods given 24 hrs on a 12-core machine. Mafft fails to complete on 200K sequences. Clustal-Omega only completes on 10K dataset. 1 Million RNASim: UPP(Fast) generated an alignment in 12 days compared to 15 days for PASTA. UPP(Fast) resulted in a better alignment (5.7% lower error), but PASTA resulted in a better tree (1.5% lower error).
Running Time Wall-clock time used (in hours) given 12 processors
Ensemble of HMMs Use a collection of HMMs instead of a single HMM to represent a backbone alignment Improves alignment accuracy, which can lead to better downstream analyses – Phylogenetic placement (SEPP; PSB 2012) – Taxonomic identification (TIPP, Bioinformatics 2014)